PROC GLM, II: Two-way Analysis of Variance and Interaction n54703.008
One-way analysis of variance is a fairly straightforward generalization
of the t-test.
Two-way analysis of variance adds several complications. First, there
is the idea that experiments can be designed to estimate and test for the effects
of two or more factors simultaneously. This is essentially R. A. Fisher's
invention, and it happened in part because he worked at Rothamsted Agricultural
Station. The agricultural scientists with whom he worked wanted to study
factors which affect the yields of crops (wheat, etc.). Because it takes
essentially a whole year to collect data on crop yields, it would be good to
be able to study several factors at the same time. Fisher realized that this
could be done with some reasonable assumptions.
But it gets complicated rather quickly. The second complication is that
when you have two factors that can affect a quantitative outcome like crop
yield - say, one factor is fertilizer and the other is amount of irrigation -
one possibility is that when both are used, their effects simply add together.
Another possibility is that they are synergistic - that is, the effect of
fertilizer is considerably accelerated by the presence of irrigation. Another
possibility is that they interact negatively (antagonistically) - that is,
fertilizer alone increases crop yields, and irrigation alone increases crop
yields, but the two together actually decrease crop yields for some reason -
perhaps by causing the plant to expend its energies into growing leaves and
stalks rather than in growing more and larger seeds.
Thus in two-way analysis of variance, one wants to estimate the effects
of each factor and test for whether they are significant, but also test
for the presence of interactions (which be either synergistic or antagonistic)
and to estimate the magnitude of the interaction in some way.
Here is an example from the Lung Health Study. The outcome or dependent
variable is percent weight gain from baseline to Year 5. The two factors of
interest are (1) smoking status at Year 5, and (2) gender. Thus the questions
of interest are:
1) Is Year 5 smoking status related to percent weight gain ?
2) Does gender affect percent weight gain ?
3) Within smoking status categories, are their differences between the
genders in weight gain?
Year 5 smoking status is called 'IVPCSUS5' and it has three categories:
IVPCSUS5 = 1: sustained quitter
IVPCSUS5 = 2: intermittent quitter
IVPCSUS5 = 3: continuing smoker
Gender, as before, has two categories:
GENDER = 0: male
GENDER = 1: female
* ==================================================================== ;
proc means data = smoke ;
class gender ivpcsus5 ;
var wgtchg05 pwtchg05 ;
title1 'PROC MEANS: Weight change, baseline to Year 5' ;
title2 'versus gender and Year 5 smoking status' ;
format gender gender. ivpcsus5 ivpcsus. ;
run ;
proc glm data = smoke ;
class gender ivpcsus5 ;
model pwtchg05 = gender ivpcsus5 / solution ;
title1 'Lung Health Study: % weight change, Baseline to Year 5' ;
title2 'versus gender and Year 5 smoking status' ;
format gender gender. ivpcsus5 ivpcsus. ;
run ;
proc glm data = smoke ;
class gender ivpcsus5 ;
model pwtchg05 = gender | ivpcsus5 / solution ;
title1 'Lung Health Study: % weight change, Baseline to Year 5' ;
title2 'versus gender and Year 5 smoking status, and interaction' ;
title3 'of gender and smoking status' ;
means gender | ivpcsus5 / bon ;
format gender gender. ivpcsus5 ivpcsus. ;
run ;
endsas ;
========================================================================================================================
PROC MEANS: Weight change, baseline to Year 5
versus gender and Year 5 smoking status 1
19:21 Tuesday, March 2, 2004
GENDER IVPCSUS5 N Obs Variable Label N Mean Std Dev
---------------------------------------------------------------------------------------------------------------
MEN SUSTAINED QUIT 616 WGTCHG05 Weight change, Year 5 - Baseline 616 7.5814935 6.9734536
PWTCHG05 Pct weight change, Year 5 - Baseline 616 9.4168640 8.2872195
INTERMITT QUIT 963 WGTCHG05 Weight change, Year 5 - Baseline 935 5.1493048 6.9321518
PWTCHG05 Pct weight change, Year 5 - Baseline 935 6.3261682 8.3515810
CONTIN SMOKER 2022 WGTCHG05 Weight change, Year 5 - Baseline 1870 1.2616043 5.3922831
PWTCHG05 Pct weight change, Year 5 - Baseline 1870 1.5892326 6.5480143
WOMEN SUSTAINED QUIT 321 WGTCHG05 Weight change, Year 5 - Baseline 319 8.6595611 6.8005707
PWTCHG05 Pct weight change, Year 5 - Baseline 319 13.4672099 10.2055760
INTERMITT QUIT 622 WGTCHG05 Weight change, Year 5 - Baseline 612 5.7668301 6.8957798
PWTCHG05 Pct weight change, Year 5 - Baseline 612 8.8775309 10.0542127
CONTIN SMOKER 1194 WGTCHG05 Weight change, Year 5 - Baseline 1117 1.9316920 5.5939001
PWTCHG05 Pct weight change, Year 5 - Baseline 1117 3.0335047 8.5554308
---------------------------------------------------------------------------------------------------------------
GENDER IVPCSUS5 N Obs Variable Label Minimum Maximum
---------------------------------------------------------------------------------------------------------
MEN SUSTAINED QUIT 616 WGTCHG05 Weight change, Year 5 - Baseline -30.9000000 30.4000000
PWTCHG05 Pct weight change, Year 5 - Baseline -26.6609146 42.0168067
INTERMITT QUIT 963 WGTCHG05 Weight change, Year 5 - Baseline -19.8000000 51.3000000
PWTCHG05 Pct weight change, Year 5 - Baseline -23.1971154 59.1439689
CONTIN SMOKER 2022 WGTCHG05 Weight change, Year 5 - Baseline -42.7000000 32.4000000
PWTCHG05 Pct weight change, Year 5 - Baseline -38.5031560 62.7906977
WOMEN SUSTAINED QUIT 321 WGTCHG05 Weight change, Year 5 - Baseline -12.2000000 39.5000000
PWTCHG05 Pct weight change, Year 5 - Baseline -16.4864865 57.1635311
INTERMITT QUIT 622 WGTCHG05 Weight change, Year 5 - Baseline -17.5000000 29.4000000
PWTCHG05 Pct weight change, Year 5 - Baseline -21.8750000 45.4400000
CONTIN SMOKER 1194 WGTCHG05 Weight change, Year 5 - Baseline -24.5000000 39.7000000
PWTCHG05 Pct weight change, Year 5 - Baseline -30.7692308 50.7024266
---------------------------------------------------------------------------------------------------------
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
--------------------------------------------------------------------------------------------------------------------------
Lung Health Study: % weight change, Baseline to Year 5 2
versus gender and Year 5 smoking status 19:21 Tuesday, March 2, 2004
General Linear Models Procedure
Class Level Information
Class Levels Values
GENDER 2 MEN WOMEN
IVPCSUS5 3 CONTIN SMOKER INTERMITT QUIT SUSTAINED QUIT
Number of observations in data set = 5887
NOTE: Due to missing values, only 5469 observations can be used in this analysis.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
--------------------------------------------------------------------------------------------------------------------------
Lung Health Study: % weight change, Baseline to Year 5 3
versus gender and Year 5 smoking status 19:21 Tuesday, March 2, 2004
General Linear Models Procedure
Dependent Variable: PWTCHG05 Pct weight change, Year 5 - Baseline
Source DF Sum of Squares Mean Square F Value Pr > F
Model 3 70604.24233908 23534.74744636 352.13 0.0001
Error 5465 365254.38689887 66.83520346
Corrected Total 5468 435858.62923795
R-Square C.V. Root MSE PWTCHG05 Mean
0.161989 160.7998 8.17528002 5.08413493
Source DF Type I SS Mean Square F Value Pr > F
GENDER 1 5712.48856447 5712.48856447 85.47 0.0001
IVPCSUS5 2 64891.75377461 32445.87688730 485.46 0.0001
Source DF Type III SS Mean Square F Value Pr > F
GENDER 1 6149.73514550 6149.73514550 92.01 0.0001
IVPCSUS5 2 64891.75377461 32445.87688730 485.46 0.0001
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 12.24319792 B 39.90 0.0001 0.30684994
GENDER MEN -2.19247064 B -9.59 0.0001 0.22856421
WOMEN 0.00000000 B . . .
IVPCSUS5 CONTIN SMOKER -8.74128640 B -28.52 0.0001 0.30645226
INTERMITT QUIT -3.58257980 B -10.57 0.0001 0.33887958
SUSTAINED QUIT 0.00000000 B . . .
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.
Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
--------------------------------------------------------------------------------------------------------------------------
Lung Health Study: % weight change, Baseline to Year 5 4
versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004
of gender and smoking status
General Linear Models Procedure
Class Level Information
Class Levels Values
GENDER 2 MEN WOMEN
IVPCSUS5 3 CONTIN SMOKER INTERMITT QUIT SUSTAINED QUIT
Number of observations in data set = 5887
NOTE: Due to missing values, only 5469 observations can be used in this analysis.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
--------------------------------------------------------------------------------------------------------------------------
Lung Health Study: % weight change, Baseline to Year 5 5
versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004
of gender and smoking status
General Linear Models Procedure
Dependent Variable: PWTCHG05 Pct weight change, Year 5 - Baseline
Source DF Sum of Squares Mean Square F Value Pr > F
Model 5 71768.77643291 14353.75528658 215.37 0.0001
Error 5463 364089.85280504 66.64650427
Corrected Total 5468 435858.62923795
R-Square C.V. Root MSE PWTCHG05 Mean
0.164661 160.5727 8.16373103 5.08413493
Source DF Type I SS Mean Square F Value Pr > F
GENDER 1 5712.48856447 5712.48856447 85.71 0.0001
IVPCSUS5 2 64891.75377461 32445.87688730 486.84 0.0001
GENDER*IVPCSUS5 2 1164.53409383 582.26704692 8.74 0.0002
Source DF Type III SS Mean Square F Value Pr > F
GENDER 1 7280.70647178 7280.70647178 109.24 0.0001
IVPCSUS5 2 64528.56647825 32264.28323912 484.11 0.0001
GENDER*IVPCSUS5 2 1164.53409383 582.26704692 8.74 0.0002
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 13.46720987 B 29.46 0.0001 0.45708119
GENDER MEN -4.05034591 B -7.19 0.0001 0.56313017
WOMEN 0.00000000 B . . .
IVPCSUS5 CONTIN SMOKER -10.43370520 B -20.13 0.0001 0.51825557
INTERMITT QUIT -4.58967892 B -8.14 0.0001 0.56375769
SUSTAINED QUIT 0.00000000 B . . .
GENDER*IVPCSUS5 MEN CONTIN SMOKER
2.60607382 B 4.06 0.0001 0.64220017
MEN INTERMITT QUIT
1.49898314 B 2.13 0.0336 0.70519131
MEN SUSTAINED QUIT
0.00000000 B . . .
WOMEN CONTIN SMOKER
0.00000000 B . . .
WOMEN INTERMITT QUIT
0.00000000 B . . .
WOMEN SUSTAINED QUIT
0.00000000 B . . .
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.
Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
--------------------------------------------------------------------------------------------------------------------------
Lung Health Study: % weight change, Baseline to Year 5 6
versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004
of gender and smoking status
General Linear Models Procedure
Bonferroni (Dunn) T tests for variable: PWTCHG05
NOTE: This test controls the type I experimentwise error rate, but generally has a higher
type II error rate than REGWQ.
Alpha= 0.05 df= 5463 MSE= 66.6465
Critical Value of T= 1.96
Minimum Significant Difference= 0.4471
WARNING: Cell sizes are not equal.
Harmonic Mean of cell sizes= 2562.153
Means with the same letter are not significantly different.
Bon Grouping Mean N GENDER
A 6.4050 2048 WOMEN
B 4.2934 3421 MEN
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
Lung Health Study: % weight change, Baseline to Year 5 7
versus gender and Year 5 smoking status, and interaction 19:21 Tuesday, March 2, 2004
of gender and smoking status
General Linear Models Procedure
Bonferroni (Dunn) T tests for variable: PWTCHG05
NOTE: This test controls the type I experimentwise error rate but generally has a higher
type II error rate than Tukey's for all pairwise comparisons.
Alpha= 0.05 Confidence= 0.95 df= 5463 MSE= 66.6465
Critical Value of T= 2.39472
Comparisons significant at the 0.05 level are indicated by '***'.
Simultaneous Simultaneous
Lower Difference Upper
IVPCSUS5 Confidence Between Confidence
Comparison Limit Means Limit
SUSTAINED QUIT - INTERMITT QUIT 2.6534 3.4632 4.2731 ***
SUSTAINED QUIT - CONTIN SMOKER 7.9368 8.6694 9.4020 ***
INTERMITT QUIT - SUSTAINED QUIT -4.2731 -3.4632 -2.6534 ***
INTERMITT QUIT - CONTIN SMOKER 4.5938 5.2062 5.8186 ***
CONTIN SMOKER - SUSTAINED QUIT -9.4020 -8.6694 -7.9368 ***
CONTIN SMOKER - INTERMITT QUIT -5.8186 -5.2062 -4.5938 ***
Level of Level of -----------PWTCHG05----------
GENDER IVPCSUS5 N Mean SD
MEN CONTIN SMOKER 1870 1.5892326 6.5480143
MEN INTERMITT QUIT 935 6.3261682 8.3515810
MEN SUSTAINED QUIT 616 9.4168640 8.2872195
WOMEN CONTIN SMOKER 1117 3.0335047 8.5554308
WOMEN INTERMITT QUIT 612 8.8775309 10.0542127
WOMEN SUSTAINED QUIT 319 13.4672099 10.2055760
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 02MAR04 19:21
========================================================================================================================
There are two analyses shown here. The first is a two-way analysis of
variance with smoking status and gender as factors, and no interaction. The
second is a two-way analysis of variance with smoking status, gender and their
interaction included.
To really understand what is going on here, you need to know what the
underlying model actually is. This is not always made clear in textbooks dealing
with the analysis of variance. For Model I (no interaction) the model is:
PWTCHG05 = a0 + a1*Male + a2*Cont + a3*Intermitt + error,
where Male is an indicator variable (or dummy variable) representing male
gender: that is, Male = 1 represents male gender, Male = 0 represents female.
Also, Cont is an indicator variable representing Continuing Smokers:
Cont = 1 means that the person is a Continuing Smoker, while Cont = 0 means
the person is NOT a continuing smoker. Finally, Intermitt is an indicator
variable for Intermittent Quitters.
Note that female (Women) is not represented explicitly in the model.
Note also that Sustained Quitters are not represented explicitly.
Both of these categories are, however, represented: if Male = 0, then
the person must be female. If Cont = 0 and Intermitt = 0, the person must
be a sustained quitter. What this means is that the value of the coefficient
a0 represents the mean percent weight change for *female sustained quitters*.
If you want to find the predicted percent weight change for *male intermittent
quitters*, it would be a0 + a1 + a3:
a0 = 12.24
a1 = -2.19
a3 = -3.58
-----------------------
a0 + a1 + a3 = 6.47
Note that the *observed* mean % weight gain in male intermittent quitters is 6.33 (from
the proc means printout).
Also note that the 'Model' sum of squares for this model is 3. This
corresponds to the three non-intercept coefficients a1, a2, and a3. The
degrees of freedom of the corrected total sum of squares is 5469 - 1,
and the degrees of freedom for the error sum of squares is 5469 - 1 - 3 = 5465.
There is no test for interaction with this model. There are however
tests for both of the 'main effects': gender and smoking status. These are
shown in the printout in the 'Type I SS' and 'Type III SS' sections. For
a no-interaction model like this one, the Type I SS and the Type III SS agree.
Both indicate highly significant effects of both factors on percent weight
change.
It may be useful to put the observed and predicted percent weight gains in a
2 x 3 table, as follows:
Year 5 Smoking Status
----------------------------------------------
Sust Quit Intermitt Q Contin Smk
----------------------------------------------
| | | |
| obs 9.42 | obs 6.33 | obs 1.59 |
Men | | | |
| pred 10.04 | pred 6.46 | pred 1.31 |
| | | |
----------------------------------------------
| | | |
| obs 13.47 | obs 8.88 | obs 3.03 |
Women | | | |
| pred 12.23 | pred 8.65 | pred 3.50 |
| | | |
----------------------------------------------
Notice that in this table, there is not perfect agreement between
the observed and predicted (from the Model) mean percent weight changes.
The predicteds tend to be higher than the observeds for men, but lower
than the observeds for women. This is an indication that there is room
for improvement with this model; in fact it suggests that there is an
interaction of gender with smoking status.
This is the motivation for the second model, which does include
interaction terms:
--------------------------------------------------------------------------------------------------------------------------
proc glm data = smoke ;
class gender ivpcsus5 ;
model pwtchg05 = gender | ivpcsus5 / solution ;
title1 'Lung Health Study: % weight change, Baseline to Year 5' ;
title2 'versus gender and Year 5 smoking status, and interaction' ;
title3 'of gender and smoking status' ;
means gender | ivpcsus5 / bon ;
format gender gender. ivpcsus5 ivpcsus. ;
run ;
--------------------------------------------------------------------------------------------------------------------------
Note that the 'model' statement represents the model in the form:
gender | ivpcsus5. This is SAS shorthand for saying:
gender ivpcsus5 gender*ivpcsus5,
where gender and ivpcsus5 represent *main effects* and gender*ivpcsus5
represents the interaction terms.
Again it is a good idea to know what the correct, full expression for this
model actually is. It is:
PWTCHG05 = a0 + a1*Male + a2*Cont + a3*Intermitt
+ a4*Male*Cont + a5*Male*Intermitt + error.
Notice that there are 6 coefficients in this model: the intercept term
plus 5 others. Not coincidentally, this equals the number of cells in the
table above.
We can compute the predicted means from the coefficients for this
model also. For example, for Male Intermittent Quitters, the predicted
mean would be
a0 + a1 + a3 + a5 = 13.47 - 4.05 - 4.58 + 1.50 = 6.34.
which is, also not coincidentally, very close (to within roundoff) of the observed
mean value for this category, 6.33.
Computing the observed and predicted values for the other cells in the
table, we have:
Year 5 Smoking Status
----------------------------------------------
Sust Quit Intermitt Q Contin Smk
----------------------------------------------
| | | |
| obs 9.42 | obs 6.33 | obs 1.59 |
Men | | | |
| pred 9.42 | pred 6.34 | pred 1.59 |
| | | |
----------------------------------------------
| | | |
| obs 13.47 | obs 8.88 | obs 3.03 |
Women | | | |
| pred 13.47 | pred 8.88 | pred 3.04 |
| | | |
----------------------------------------------
This shows that there is a substantial improvement in the agreement
of observed and predicted when the interaction terms is included.
This is sometimes described as a *saturated* model: the predicted
means are equal to the observed means. Given only these two categorical
predictors, GENDER and IVPCSUS5, there are no more terms to be added to
the model.
Note that in the ANOVA table for this model, the degrees of freedom
for Model SS is 5: i.e., the number of coefficients in the model minus 1.
Note also that for this model, the Type I SS and the Type III SS do not agree.
Note that sum of all the Type I squares equals the overall regression sum of
squares, but the same is not true for the Type III SS.
So what are these Type I and Type II Sums of Squares ?
The Type I sum of squares is essentially hierarchical. The first Type I
sum of squares is for Gender. That means what the effect of Gender is if Gender
is the only term in the model. The second Type I SS is for IVPCSUS5. That
represents the *additional* effect that IVPCSUS5 has *after* accounting for the
effect of gender. Finally, the third Type I SS is GENDER*IVPCSUS5. This
represents the additional effect of the interaction term AFTER both of the main effects,
GENDER and IVPCSUS5, are in the model. In the example given above, this sum of
squares equals 1164.53.
The Type III sum of squares is NOT hierarchical. Each of the Type III
sum of squares represents the effect of that term after *all* of the other
terms, whether they be main effects or interactions, are in the model. Note
that the Type III sums of squares for GENDER and IVPCSUS5 are not the same
as the corresponding Type I SS. However, as may be expected, the Type III
interaction SS *is* the same as the Type I interaction SS.
Some statisticians like Type III sums of squares. Others prefer
Type I sums of squares.
Those who do NOT like Type III sums of squares say: it makes no sense
to estimate a main effect (like gender, e.g.) after you have already entered
the interaction term in the model.
Those who do NOT like Type I sums of squares say: since they are
hierarchical, it matters what order the terms are entered into the regression.
That is correct. If you reverse GENDER and IVPCSUS4, you will get different
Type I sums of squares for each, although the overall sum stays the same.
The Der-Everitt textbook expresses a clear, strong preference for Type I
sum of squares. I am in agreement. However, if you want to rely on Type I
sums of squares, it is a good idea to try both orders for entering the variables.
This means that it is not so easy to describe the results. You have to report
and describe one or the other. If there is a 'natural' order to these
variables, you should use that. For example, suppose you are mainly interested
in the effect of smoking status and you regard gender as a nuisance variable
for which you want to 'adjust'. In that case it would make sense to determine
the effect of smoking status AFTER you see the effect of gender. Thus the
model above,
model pwtchg05 = gender ivpcsus5 gender*ivpcsus5
makes sense, because the Type I SS will list the effect of GENDER first,
and then will show the effect of IVPCSUS5 after gender is in the model.
Finally, how do you report results, when you observe, as in this
analysis, both significant main effects and a significant interaction?
Most texts in this situation recommend saying that there is an
interaction effect, but not claiming that there are main effects. This
is somewhat unsatisfactory. The only thing that tells the whole story,
in a sense, are the *observed means*.
When there is interaction, what can you report, without making some
statistical purist angry?
Say the two factors are A and B, and A has two levels: A = 1 and A = 2, and B has
two levels also, B = 1 and B = 2. Then you can consider whether, when you
restrict to people assigned to B1, whether there is a difference between
A1 and A2. You could do this as follows:
proc glm data = dataset ;
where B = 1 ;
class A ;
model Y = A / solution ;
title 'Effect of factor A on outcome Y when B = 1.' ;
run ;
You would similarly consider A = 1 versus A = 2 when B = 2.
And similarly, B = 1 versus B = 2 when A = 1. And finally
B = 1 versus B = 2 when A = 2.
You would also report the interaction from the analysis
proc glm data = dataset ;
class A B ;
model Y = A B A*B / solution ;
title 'Effects of A, B, and their interaction, A*B.' ;
run ;
If the interaction is not significant, many investigators report
only the results of the no-interaction analysis:
proc glm data = dataset ;
class A B ;
model Y = A B / solution ;
title 'Effects of A and B - no interaction assumed.' ;
run ;
Interactions in general are often difficult to interpret. The situation
can be considerably more complex than what is seen here with smoking status
and gender. One of the two factors has only two levels. In general, each
can have several levels. There can be more than two factors. This raises the
possibility of multiple two-way interactions and three-way or higher interactions.
The number of coefficients for such models can be very large, and the patterns
of interactions very difficult to understand.
* ==================================================================== ;
n54703.008 Last update: March 6, 2004.