PROC GLM, II: Two-way Analysis of Variance and Interaction n54703.008 One-way analysis of variance is a fairly straightforward generalization of the t-test. Two-way analysis of variance adds several complications. First, there is the idea that experiments can be designed to estimate and test for the effects of two or more factors simultaneously. This is essentially R. A. Fisher's invention, and it happened in part because he worked at Rothamsted Agricultural Station. The agricultural scientists with whom he worked wanted to study factors which affect the yields of crops (wheat, etc.). Because it takes essentially a whole year to collect data on crop yields, it would be good to be able to study several factors at the same time. Fisher realized that this could be done with some reasonable assumptions. But it gets complicated rather quickly. The second complication is that when you have two factors that can affect a quantitative outcome like crop yield - say, one factor is fertilizer and the other is amount of irrigation - one possibility is that when both are used, their effects simply add together. Another possibility is that they are synergistic - that is, the effect of fertilizer is considerably accelerated by the presence of irrigation. Another possibility is that they interact negatively (antagonistically) - that is, fertilizer alone increases crop yields, and irrigation alone increases crop yields, but the two together actually decrease crop yields for some reason - perhaps by causing the plant to expend its energies into growing leaves and stalks rather than in growing more and larger seeds. Thus in two-way analysis of variance, one wants to estimate the effects of each factor and test for whether they are significant, but also test for the presence of interactions (which be either synergistic or antagonistic) and to estimate the magnitude of the interaction in some way. Here is an example from the Lung Health Study. The outcome or dependent variable is percent weight gain from baseline to Year 5. The two factors of interest are (1) smoking status at Year 5, and (2) gender. Thus the questions of interest are: 1) Is Year 5 smoking status related to percent weight gain ? 2) Does gender affect percent weight gain ? 3) Within smoking status categories, are their differences between the genders in weight gain? Year 5 smoking status is called 'IVPCSUS5' and it has three categories: IVPCSUS5 = 1: sustained quitter IVPCSUS5 = 2: intermittent quitter IVPCSUS5 = 3: continuing smoker Gender, as before, has two categories: GENDER = 0: male GENDER = 1: female * ==================================================================== ; proc means data = smoke ; class gender ivpcsus5 ; var wgtchg05 pwtchg05 ; title1 'PROC MEANS: Weight change, baseline to Year 5' ; title2 'versus gender and Year 5 smoking status' ; format gender gender. ivpcsus5 ivpcsus. ; run ; proc glm data = smoke ; class gender ivpcsus5 ; model pwtchg05 = gender ivpcsus5 / solution ; title1 'Lung Health Study: % weight change, Baseline to Year 5' ; title2 'versus gender and Year 5 smoking status' ; format gender gender. ivpcsus5 ivpcsus. ; run ; proc glm data = smoke ; class gender ivpcsus5 ; model pwtchg05 = gender | ivpcsus5 / solution ; title1 'Lung Health Study: % weight change, Baseline to Year 5' ; title2 'versus gender and Year 5 smoking status, and interaction' ; title3 'of gender and smoking status' ; means gender | ivpcsus5 / bon ; format gender gender. ivpcsus5 ivpcsus. ; run ; endsas ; ======================================================================================================================== PROC MEANS: Weight change, baseline to Year 5 versus gender and Year 5 smoking status 1 19:21 Tuesday, March 2, 2004 GENDER IVPCSUS5 N Obs Variable Label N Mean Std Dev --------------------------------------------------------------------------------------------------------------- MEN SUSTAINED QUIT 616 WGTCHG05 Weight change, Year 5 - Baseline 616 7.5814935 6.9734536 PWTCHG05 Pct weight change, Year 5 - Baseline 616 9.4168640 8.2872195 INTERMITT QUIT 963 WGTCHG05 Weight change, Year 5 - Baseline 935 5.1493048 6.9321518 PWTCHG05 Pct weight change, Year 5 - Baseline 935 6.3261682 8.3515810 CONTIN SMOKER 2022 WGTCHG05 Weight change, Year 5 - Baseline 1870 1.2616043 5.3922831 PWTCHG05 Pct weight change, Year 5 - Baseline 1870 1.5892326 6.5480143 WOMEN SUSTAINED QUIT 321 WGTCHG05 Weight change, Year 5 - Baseline 319 8.6595611 6.8005707 PWTCHG05 Pct weight change, Year 5 - Baseline 319 13.4672099 10.2055760 INTERMITT QUIT 622 WGTCHG05 Weight change, Year 5 - Baseline 612 5.7668301 6.8957798 PWTCHG05 Pct weight change, Year 5 - Baseline 612 8.8775309 10.0542127 CONTIN SMOKER 1194 WGTCHG05 Weight change, Year 5 - Baseline 1117 1.9316920 5.5939001 PWTCHG05 Pct weight change, Year 5 - Baseline 1117 3.0335047 8.5554308 --------------------------------------------------------------------------------------------------------------- GENDER IVPCSUS5 N Obs Variable Label Minimum Maximum --------------------------------------------------------------------------------------------------------- MEN SUSTAINED QUIT 616 WGTCHG05 Weight change, Year 5 - Baseline -30.9000000 30.4000000 PWTCHG05 Pct weight change, Year 5 - Baseline -26.6609146 42.0168067 INTERMITT QUIT 963 WGTCHG05 Weight change, Year 5 - Baseline -19.8000000 51.3000000 PWTCHG05 Pct weight change, Year 5 - Baseline -23.1971154 59.1439689 CONTIN SMOKER 2022 WGTCHG05 Weight change, Year 5 - Baseline -42.7000000 32.4000000 PWTCHG05 Pct weight change, Year 5 - Baseline -38.5031560 62.7906977 WOMEN SUSTAINED QUIT 321 WGTCHG05 Weight change, Year 5 - Baseline -12.2000000 39.5000000 PWTCHG05 Pct weight change, Year 5 - Baseline -16.4864865 57.1635311 INTERMITT QUIT 622 WGTCHG05 Weight change, Year 5 - Baseline -17.5000000 29.4000000 PWTCHG05 Pct weight change, Year 5 - Baseline -21.8750000 45.4400000 CONTIN SMOKER 1194 WGTCHG05 Weight change, Year 5 - Baseline -24.5000000 39.7000000 PWTCHG05 Pct weight change, Year 5 - Baseline -30.7692308 50.7024266 --------------------------------------------------------------------------------------------------------- LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 -------------------------------------------------------------------------------------------------------------------------- Lung Health Study: % weight change, Baseline to Year 5 2 versus gender and Year 5 smoking status 19:21 Tuesday, March 2, 2004 General Linear Models Procedure Class Level Information Class Levels Values GENDER 2 MEN WOMEN IVPCSUS5 3 CONTIN SMOKER INTERMITT QUIT SUSTAINED QUIT Number of observations in data set = 5887 NOTE: Due to missing values, only 5469 observations can be used in this analysis. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 -------------------------------------------------------------------------------------------------------------------------- Lung Health Study: % weight change, Baseline to Year 5 3 versus gender and Year 5 smoking status 19:21 Tuesday, March 2, 2004 General Linear Models Procedure Dependent Variable: PWTCHG05 Pct weight change, Year 5 - Baseline Source DF Sum of Squares Mean Square F Value Pr > F Model 3 70604.24233908 23534.74744636 352.13 0.0001 Error 5465 365254.38689887 66.83520346 Corrected Total 5468 435858.62923795 R-Square C.V. Root MSE PWTCHG05 Mean 0.161989 160.7998 8.17528002 5.08413493 Source DF Type I SS Mean Square F Value Pr > F GENDER 1 5712.48856447 5712.48856447 85.47 0.0001 IVPCSUS5 2 64891.75377461 32445.87688730 485.46 0.0001 Source DF Type III SS Mean Square F Value Pr > F GENDER 1 6149.73514550 6149.73514550 92.01 0.0001 IVPCSUS5 2 64891.75377461 32445.87688730 485.46 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 12.24319792 B 39.90 0.0001 0.30684994 GENDER MEN -2.19247064 B -9.59 0.0001 0.22856421 WOMEN 0.00000000 B . . . IVPCSUS5 CONTIN SMOKER -8.74128640 B -28.52 0.0001 0.30645226 INTERMITT QUIT -3.58257980 B -10.57 0.0001 0.33887958 SUSTAINED QUIT 0.00000000 B . . . NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 -------------------------------------------------------------------------------------------------------------------------- Lung Health Study: % weight change, Baseline to Year 5 4 versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004 of gender and smoking status General Linear Models Procedure Class Level Information Class Levels Values GENDER 2 MEN WOMEN IVPCSUS5 3 CONTIN SMOKER INTERMITT QUIT SUSTAINED QUIT Number of observations in data set = 5887 NOTE: Due to missing values, only 5469 observations can be used in this analysis. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 -------------------------------------------------------------------------------------------------------------------------- Lung Health Study: % weight change, Baseline to Year 5 5 versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004 of gender and smoking status General Linear Models Procedure Dependent Variable: PWTCHG05 Pct weight change, Year 5 - Baseline Source DF Sum of Squares Mean Square F Value Pr > F Model 5 71768.77643291 14353.75528658 215.37 0.0001 Error 5463 364089.85280504 66.64650427 Corrected Total 5468 435858.62923795 R-Square C.V. Root MSE PWTCHG05 Mean 0.164661 160.5727 8.16373103 5.08413493 Source DF Type I SS Mean Square F Value Pr > F GENDER 1 5712.48856447 5712.48856447 85.71 0.0001 IVPCSUS5 2 64891.75377461 32445.87688730 486.84 0.0001 GENDER*IVPCSUS5 2 1164.53409383 582.26704692 8.74 0.0002 Source DF Type III SS Mean Square F Value Pr > F GENDER 1 7280.70647178 7280.70647178 109.24 0.0001 IVPCSUS5 2 64528.56647825 32264.28323912 484.11 0.0001 GENDER*IVPCSUS5 2 1164.53409383 582.26704692 8.74 0.0002 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 13.46720987 B 29.46 0.0001 0.45708119 GENDER MEN -4.05034591 B -7.19 0.0001 0.56313017 WOMEN 0.00000000 B . . . IVPCSUS5 CONTIN SMOKER -10.43370520 B -20.13 0.0001 0.51825557 INTERMITT QUIT -4.58967892 B -8.14 0.0001 0.56375769 SUSTAINED QUIT 0.00000000 B . . . GENDER*IVPCSUS5 MEN CONTIN SMOKER 2.60607382 B 4.06 0.0001 0.64220017 MEN INTERMITT QUIT 1.49898314 B 2.13 0.0336 0.70519131 MEN SUSTAINED QUIT 0.00000000 B . . . WOMEN CONTIN SMOKER 0.00000000 B . . . WOMEN INTERMITT QUIT 0.00000000 B . . . WOMEN SUSTAINED QUIT 0.00000000 B . . . NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 -------------------------------------------------------------------------------------------------------------------------- Lung Health Study: % weight change, Baseline to Year 5 6 versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004 of gender and smoking status General Linear Models Procedure Bonferroni (Dunn) T tests for variable: PWTCHG05 NOTE: This test controls the type I experimentwise error rate, but generally has a higher type II error rate than REGWQ. Alpha= 0.05 df= 5463 MSE= 66.6465 Critical Value of T= 1.96 Minimum Significant Difference= 0.4471 WARNING: Cell sizes are not equal. Harmonic Mean of cell sizes= 2562.153 Means with the same letter are not significantly different. Bon Grouping Mean N GENDER A 6.4050 2048 WOMEN B 4.2934 3421 MEN LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 Lung Health Study: % weight change, Baseline to Year 5 7 versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004 of gender and smoking status General Linear Models Procedure Bonferroni (Dunn) T tests for variable: PWTCHG05 NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than Tukey's for all pairwise comparisons. Alpha= 0.05 Confidence= 0.95 df= 5463 MSE= 66.6465 Critical Value of T= 2.39472 Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper IVPCSUS5 Confidence Between Confidence Comparison Limit Means Limit SUSTAINED QUIT - INTERMITT QUIT 2.6534 3.4632 4.2731 *** SUSTAINED QUIT - CONTIN SMOKER 7.9368 8.6694 9.4020 *** INTERMITT QUIT - SUSTAINED QUIT -4.2731 -3.4632 -2.6534 *** INTERMITT QUIT - CONTIN SMOKER 4.5938 5.2062 5.8186 *** CONTIN SMOKER - SUSTAINED QUIT -9.4020 -8.6694 -7.9368 *** CONTIN SMOKER - INTERMITT QUIT -5.8186 -5.2062 -4.5938 *** Level of Level of -----------PWTCHG05---------- GENDER IVPCSUS5 N Mean SD MEN CONTIN SMOKER 1870 1.5892326 6.5480143 MEN INTERMITT QUIT 935 6.3261682 8.3515810 MEN SUSTAINED QUIT 616 9.4168640 8.2872195 WOMEN CONTIN SMOKER 1117 3.0335047 8.5554308 WOMEN INTERMITT QUIT 612 8.8775309 10.0542127 WOMEN SUSTAINED QUIT 319 13.4672099 10.2055760 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21 ======================================================================================================================== There are two analyses shown here. The first is a two-way analysis of variance with smoking status and gender as factors, and no interaction. The second is a two-way analysis of variance with smoking status, gender and their interaction included. To really understand what is going on here, you need to know what the underlying model actually is. This is not always made clear in textbooks dealing with the analysis of variance. For Model I (no interaction) the model is: PWTCHG05 = a0 + a1*Male + a2*Cont + a3*Intermitt + error, where Male is an indicator variable (or dummy variable) representing male gender: that is, Male = 1 represents male gender, Male = 0 represents female. Also, Cont is an indicator variable representing Continuing Smokers: Cont = 1 means that the person is a Continuing Smoker, while Cont = 0 means the person is NOT a continuing smoker. Finally, Intermitt is an indicator variable for Intermittent Quitters. Note that female (Women) is not represented explicitly in the model. Note also that Sustained Quitters are not represented explicitly. Both of these categories are, however, represented: if Male = 0, then the person must be female. If Cont = 0 and Intermitt = 0, the person must be a sustained quitter. What this means is that the value of the coefficient a0 represents the mean percent weight change for *female sustained quitters*. If you want to find the predicted percent weight change for *male intermittent quitters*, it would be a0 + a1 + a3: a0 = 12.24 a1 = -2.19 a3 = -3.58 ----------------------- a0 + a1 + a3 = 6.47 Note that the *observed* mean % weight gain in male intermittent quitters is 6.33 (from the proc means printout). Also note that the 'Model' sum of squares for this model is 3. This corresponds to the three non-intercept coefficients a1, a2, and a3. The degrees of freedom of the corrected total sum of squares is 5469 - 1, and the degrees of freedom for the error sum of squares is 5469 - 1 - 3 = 5465. There is no test for interaction with this model. There are however tests for both of the 'main effects': gender and smoking status. These are shown in the printout in the 'Type I SS' and 'Type III SS' sections. For a no-interaction model like this one, the Type I SS and the Type III SS agree. Both indicate highly significant effects of both factors on percent weight change. It may be useful to put the observed and predicted percent weight gains in a 2 x 3 table, as follows: Year 5 Smoking Status ---------------------------------------------- Sust Quit Intermitt Q Contin Smk ---------------------------------------------- | | | | | obs 9.42 | obs 6.33 | obs 1.59 | Men | | | | | pred 10.04 | pred 6.46 | pred 1.31 | | | | | ---------------------------------------------- | | | | | obs 13.47 | obs 8.88 | obs 3.03 | Women | | | | | pred 12.23 | pred 8.65 | pred 3.50 | | | | | ---------------------------------------------- Notice that in this table, there is not perfect agreement between the observed and predicted (from the Model) mean percent weight changes. The predicteds tend to be higher than the observeds for men, but lower than the observeds for women. This is an indication that there is room for improvement with this model; in fact it suggests that there is an interaction of gender with smoking status. This is the motivation for the second model, which does include interaction terms: -------------------------------------------------------------------------------------------------------------------------- proc glm data = smoke ; class gender ivpcsus5 ; model pwtchg05 = gender | ivpcsus5 / solution ; title1 'Lung Health Study: % weight change, Baseline to Year 5' ; title2 'versus gender and Year 5 smoking status, and interaction' ; title3 'of gender and smoking status' ; means gender | ivpcsus5 / bon ; format gender gender. ivpcsus5 ivpcsus. ; run ; -------------------------------------------------------------------------------------------------------------------------- Note that the 'model' statement represents the model in the form: gender | ivpcsus5. This is SAS shorthand for saying: gender ivpcsus5 gender*ivpcsus5, where gender and ivpcsus5 represent *main effects* and gender*ivpcsus5 represents the interaction terms. Again it is a good idea to know what the correct, full expression for this model actually is. It is: PWTCHG05 = a0 + a1*Male + a2*Cont + a3*Intermitt + a4*Male*Cont + a5*Male*Intermitt + error. Notice that there are 6 coefficients in this model: the intercept term plus 5 others. Not coincidentally, this equals the number of cells in the table above. We can compute the predicted means from the coefficients for this model also. For example, for Male Intermittent Quitters, the predicted mean would be a0 + a1 + a3 + a5 = 13.47 - 4.05 - 4.58 + 1.50 = 6.34. which is, also not coincidentally, very close (to within roundoff) of the observed mean value for this category, 6.33. Computing the observed and predicted values for the other cells in the table, we have: Year 5 Smoking Status ---------------------------------------------- Sust Quit Intermitt Q Contin Smk ---------------------------------------------- | | | | | obs 9.42 | obs 6.33 | obs 1.59 | Men | | | | | pred 9.42 | pred 6.34 | pred 1.59 | | | | | ---------------------------------------------- | | | | | obs 13.47 | obs 8.88 | obs 3.03 | Women | | | | | pred 13.47 | pred 8.88 | pred 3.04 | | | | | ---------------------------------------------- This shows that there is a substantial improvement in the agreement of observed and predicted when the interaction terms is included. This is sometimes described as a *saturated* model: the predicted means are equal to the observed means. Given only these two categorical predictors, GENDER and IVPCSUS5, there are no more terms to be added to the model. Note that in the ANOVA table for this model, the degrees of freedom for Model SS is 5: i.e., the number of coefficients in the model minus 1. Note also that for this model, the Type I SS and the Type III SS do not agree. Note that sum of all the Type I squares equals the overall regression sum of squares, but the same is not true for the Type III SS. So what are these Type I and Type II Sums of Squares ? The Type I sum of squares is essentially hierarchical. The first Type I sum of squares is for Gender. That means what the effect of Gender is if Gender is the only term in the model. The second Type I SS is for IVPCSUS5. That represents the *additional* effect that IVPCSUS5 has *after* accounting for the effect of gender. Finally, the third Type I SS is GENDER*IVPCSUS5. This represents the additional effect of the interaction term AFTER both of the main effects, GENDER and IVPCSUS5, are in the model. In the example given above, this sum of squares equals 1164.53. The Type III sum of squares is NOT hierarchical. Each of the Type III sum of squares represents the effect of that term after *all* of the other terms, whether they be main effects or interactions, are in the model. Note that the Type III sums of squares for GENDER and IVPCSUS5 are not the same as the corresponding Type I SS. However, as may be expected, the Type III interaction SS *is* the same as the Type I interaction SS. Some statisticians like Type III sums of squares. Others prefer Type I sums of squares. Those who do NOT like Type III sums of squares say: it makes no sense to estimate a main effect (like gender, e.g.) after you have already entered the interaction term in the model. Those who do NOT like Type I sums of squares say: since they are hierarchical, it matters what order the terms are entered into the regression. That is correct. If you reverse GENDER and IVPCSUS4, you will get different Type I sums of squares for each, although the overall sum stays the same. The Der-Everitt textbook expresses a clear, strong preference for Type I sum of squares. I am in agreement. However, if you want to rely on Type I sums of squares, it is a good idea to try both orders for entering the variables. This means that it is not so easy to describe the results. You have to report and describe one or the other. If there is a 'natural' order to these variables, you should use that. For example, suppose you are mainly interested in the effect of smoking status and you regard gender as a nuisance variable for which you want to 'adjust'. In that case it would make sense to determine the effect of smoking status AFTER you see the effect of gender. Thus the model above, model pwtchg05 = gender ivpcsus5 gender*ivpcsus5 makes sense, because the Type I SS will list the effect of GENDER first, and then will show the effect of IVPCSUS5 after gender is in the model. Finally, how do you report results, when you observe, as in this analysis, both significant main effects and a significant interaction? Most texts in this situation recommend saying that there is an interaction effect, but not claiming that there are main effects. This is somewhat unsatisfactory. The only thing that tells the whole story, in a sense, are the *observed means*. When there is interaction, what can you report, without making some statistical purist angry? Say the two factors are A and B, and A has two levels: A = 1 and A = 2, and B has two levels also, B = 1 and B = 2. Then you can consider whether, when you restrict to people assigned to B1, whether there is a difference between A1 and A2. You could do this as follows: proc glm data = dataset ; where B = 1 ; class A ; model Y = A / solution ; title 'Effect of factor A on outcome Y when B = 1.' ; run ; You would similarly consider A = 1 versus A = 2 when B = 2. And similarly, B = 1 versus B = 2 when A = 1. And finally B = 1 versus B = 2 when A = 2. You would also report the interaction from the analysis proc glm data = dataset ; class A B ; model Y = A B A*B / solution ; title 'Effects of A, B, and their interaction, A*B.' ; run ; If the interaction is not significant, many investigators report only the results of the no-interaction analysis: proc glm data = dataset ; class A B ; model Y = A B / solution ; title 'Effects of A and B - no interaction assumed.' ; run ; Interactions in general are often difficult to interpret. The situation can be considerably more complex than what is seen here with smoking status and gender. One of the two factors has only two levels. In general, each can have several levels. There can be more than two factors. This raises the possibility of multiple two-way interactions and three-way or higher interactions. The number of coefficients for such models can be very large, and the patterns of interactions very difficult to understand. * ==================================================================== ; n54703.008 Last update: March 6, 2004.