PROC GLM, I: One-way Analysis of Variance. n54703.007
The 'GLM' in PROC GLM stands for General Linear Model. PROC GLM
can be used for analysis of variance problems, but also for regression
problems and analysis of covariance. The latter is simply a mixture
of analysis of variance and regression.
Analysis of variance is usually thought of in terms of *factors*
i.e., variables which can take on a small number of discrete values:
gender is such a variable, where perhaps gender = 0 indicates male,
and gender = 1 indicates female. Another such factor is race, which
may be coded as:
race = 1, African
race = 2, European
race = 3, Asian
race = 4, Native American
race = 5, Other
Suppose you wanted to study the relationship between race and
and cigarettes per day in smokers. It is actually possible to do
most analysis of variance problems using PROC REG, though it is
somewhat cumbersome to do so. Here is an INCORRECT approach:
------------------------------------------------------------------------
proc reg data = racesmk ;
model cigs = race ;
run ;
------------------------------------------------------------------------
In this analysis, race is entered as a *quantitative* predictor.
There is an implied order: African < European < Asian, etc.. There
is no reason to assume such an order. A better approach would be
the following:
data racesmk ;
infile 'racesmk.dat' ;
input person cigs race ;
african = 0 ; european = 0 ; asian = 0 ; native = 0 ; other = 0 ;
if race eq 1 then african = 1 ;
if race eq 2 then european = 1 ;
if race eq 3 then asian = 1 ;
if race eq 4 then native = 1 ;
if race eq 5 then other = 1 ;
run ;
proc reg data = racesmk ;
model cigs = african european asian native ;
run ;
------------------------------------------------------------------------
There are two important points to note about this regression.
First, race is represented in the model by *indicator variables*: that
is, african = 1 indicates that the person's race is African.
Second, only four of the five indicator variables are entered into
the regression. The fifth racial category corresponds essentially to
the intercept. This regression will produce a coefficient for
each of the four races entered. The coefficients are related to
the means of the dependent variable as explained below.
The following is a program and printout based on Lung Health Study
data for bmi (body mass index) and baseline cigarettes per day
versus race, using both PROC REG and PROC GLM:
* ==================================================================== ;
AWHITE = 0 ; ABLACK = 0 ; AORIENTL = 0 ; ANATIVE = 0 ;
AOTHER = 0 ; AREFUSES = 0 ;
IF RACE EQ 1 THEN AWHITE = 1 ;
IF RACE EQ 2 THEN ABLACK = 1 ;
IF RACE EQ 3 THEN AORIENTL = 1 ;
IF RACE EQ 4 THEN ANATIVE = 1 ;
IF RACE EQ 5 THEN AOTHER = 1 ;
IF RACE EQ 6 THEN AREFUSES = 1 ;
* ==================================================================== ;
PROC FORMAT ;
VALUE RACE 1 = 'WHITE'
2 = 'BLACK'
3 = 'ORIENTAL'
4 = 'NATIVE AMER'
5 = 'OTHER'
6 = 'REFUSES' ;
* ==================================================================== ;
proc means data = smoke n mean std stderr ;
class race ;
var f10cigs ;
title1 'PROC MEANS: mean values of f10cigs versus race' ;
format race race. ;
proc reg data = smoke ;
where race ne . ;
model f10cigs = awhite ablack aorientl anative ;
title1 'PROC REG: model F10cigs = black oriental native other' ;
proc glm data = smoke ;
where race ne . ;
class race ;
model f10cigs = race / solution ;
format race race. ;
title1 'PROC GLM: model F10cigs = race' ;
format race race. ;
endsas ;
* ==================================================================== ;
PROC MEANS: mean values of f10cigs versus race 18:41 Monday, March 6, 2006 1
Analysis Variable : F10CIGS CIGS PER DAY AT SCREEN 1
RACE N Obs N Mean Std Dev Std Error
---------------------------------------------------------------------
1: WHITE 5638 5638 31.5801703 12.8111185 0.1706179
2: BLACK 225 225 23.6711111 10.4841850 0.6989457
3: ORIENTAL 8 8 21.2500000 12.1037184 4.2793107
4: NATIVE AMER 7 7 39.2857143 29.2159448 11.0425892
5: OTHER 9 9 31.4444444 13.0873136 4.3624379
---------------------------------------------------------------------
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41
PROC REG: model F10cigs = black oriental native other 2
18:41 Monday, March 6, 2006
Model: MODEL1
Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Prob>F
Model 4 14787.79995 3696.94999 22.715 0.0001
Error 5882 957310.07605 162.75248
C Total 5886 972097.87600
Root MSE 12.75745 R-square 0.0152
Dep Mean 31.27280 Adj R-sq 0.0145
C.V. 40.79406
Parameter Estimates
Parameter Standard T for H0: Variable
Variable DF Estimate Error Parameter=0 Prob > |T| Label
INTERCEP 1 31.444444 4.25248265 7.394 0.0001 Intercept
AWHITE 1 0.135726 4.25587544 0.032 0.9746
ABLACK 1 -7.773333 4.33669840 -1.792 0.0731
AORIENTL 1 -10.194444 6.19900544 -1.645 0.1001
ANATIVE 1 7.841270 6.42914945 1.220 0.2226
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41
PROC GLM: model F10cigs = race 18:41 Monday, March 6, 2006 3
General Linear Models Procedure
Class Level Information
Class Levels Values
RACE 5 1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER
Number of observations in data set = 5887
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41
PROC GLM: model F10cigs = race 18:41 Monday, March 6, 2006 4
General Linear Models Procedure
Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1
Source DF Sum of Squares Mean Square F Value Pr > F
Model 4 14787.79994555 3696.94998639 22.72 0.0001
Error 5882 957310.07605241 162.75247808
Corrected Total 5886 972097.87599796
R-Square C.V. Root MSE F10CIGS Mean
0.015212 40.79406 12.75744795 31.27280448
Source DF Type I SS Mean Square F Value Pr > F
RACE 4 14787.79994555 3696.94998639 22.72 0.0001
Source DF Type III SS Mean Square F Value Pr > F
RACE 4 14787.79994555 3696.94998639 22.72 0.0001
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 31.44444444 B 7.39 0.0001 4.25248265
RACE 1: WHITE 0.13572583 B 0.03 0.9746 4.25587544
2: BLACK -7.77333333 B -1.79 0.0731 4.33669840
3: ORIENTAL -10.19444444 B -1.64 0.1001 6.19900544
4: NATIVE AMER 7.84126984 B 1.22 0.2226 6.42914945
5: OTHER 0.00000000 B . . .
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.
Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41
========================================================================================================================
The PROC REG and PROC GLM analyses above are one-way analyses of
variance. Note that a key difference is that in PROC GLM, the variable
'race' is entered as a CLASS variable. What that means is that SAS
will create individual indicator variables for each level of 'race'.
There are 5 levels (no one answered 'refused'). Thus PROC GLM does
the same thing as PROC REG, but it saves you some work: it creates
the indicator variable automatically.
Let's compare the printout from PROC MEANS, PROC REG, and PROC GLM.
Note the following for BMI and F10CIGS:
PROC MEANS: race = WHITE mean F10CIGS = 31.58
race = BLACK mean F10CIGS = 23.67
PROC REG : Intercept F10CIGS = 31.4444
WHITE coeff F10CIGS = 0.1357
BLACK coeff F10CIGS = -7.7733
PROC GLM : Intercept F10CIGS = 31.4444
WHITE coeff F10CIGS = 0.1357
BLACK coeff F10CIGS = -7.7733
You can tell what is going on here. In PROC REG, the sum of the
intercept and the coefficient equals the mean for the race. PROC REG
prints a coefficient for each of the indicator variables. PROC GLM,
however, does something else. First, it orders the indicator variables
alphabetically by their value-labels. That means that WHITE comes last
on the list. Second, it enters only 4 of the 5 indicator variables
into the analysis. That means that WHITE is the default category.
Thus in PROC GLM, just as in PROC REG, you can find the mean for the
category by adding the coefficient for the category to the intercept
term.
Now: you may want the default category to be different than that
which PROC GLM chooses automatically. You can accomplish that by
using a numeric code for the category such that the highest value
corresponds to what you want as the default category. Or, in the
FORMAT section, you can ensure that the categories are ordered as you
want as follows:
* ==================================================================== ;
PROC FORMAT ;
VALUE RACE 1 = '1-WHITE'
2 = '2-BLACK'
3 = '3-ORIENTAL'
4 = '4-NATIVE AMER'
5 = '5-OTHER'
6 = '6-REFUSES' ;
* ==================================================================== ;
One of the purposes in analysis of variance is to see whether means
for the different categories differ significantly. PROC GLM includes
tests for whether there are overall differences in the means, in the
ANOVA table. Look at the printout and answer the following:
Do the categories differ significantly on
F10CIGS ? p = ?
Note that PROC REG and PROC GLM each have advantages. PROC REG
gives you the coefficients and standard errors, but it does not
produce a test for whether the groups are different. PROC GLM produces
an F-test for group differences, but it does not print standard errors
of the coefficients.
PROC GLM can also carry out multiple-comparisons tests. As noted
above, there is a difference between the races in the mean number of
cigs per day. This is shown by the F-test. However, all this test
tells you is that you can reject the hypothesis that all the means are
the same. It does not tell you which means are different. Since there
are 5 race-groups, there are 10 possible different pairs of races
which might be compared. Your chance of seeing significant differences
between two of the groups, given that there are 10 comparisons, is s
considerably higher that 0.05 unless you make some kind of adjustment
for the fact that you are doing 10 comparisons. One way of making
such an adjustment is to use the Bonferroni procedure. This can be done
in PROC GLM as follows:
========================================================================================================================
proc glm data = smoke ;
class race ;
model f10cigs = race / solution ;
means race / bon ;
title1 'PROC GLM: Baseline cigs/day versus race ...' ;
title2 'Bonferroni Multiple Comparisons Test included' ;
format race race. ;
run ;
========================================================================================================================
PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 5
Bonferroni Multiple Comparisons Test included.
General Linear Models Procedure
Class Level Information
Class Levels Values
RACE 5 1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER
Number of observations in data set = 5887
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59
PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 6
Bonferroni Multiple Comparisons Test included.
General Linear Models Procedure
Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1
Source DF Sum of Squares Mean Square F Value Pr > F
Model 4 14787.79994555 3696.94998639 22.72 0.0001
Error 5882 957310.07605241 162.75247808
Corrected Total 5886 972097.87599796
R-Square C.V. Root MSE F10CIGS Mean
0.015212 40.79406 12.75744795 31.27280448
Source DF Type I SS Mean Square F Value Pr > F
RACE 4 14787.79994555 3696.94998639 22.72 0.0001
Source DF Type III SS Mean Square F Value Pr > F
RACE 4 14787.79994555 3696.94998639 22.72 0.0001
T for H0: Pr > |T| Std Error of
Parameter Estimate Parameter=0 Estimate
INTERCEPT 31.44444444 B 7.39 0.0001 4.25248265
RACE 1: WHITE 0.13572583 B 0.03 0.9746 4.25587544
2: BLACK -7.77333333 B -1.79 0.0731 4.33669840
3: ORIENTAL -10.19444444 B -1.64 0.1001 6.19900544
4: NATIVE AMER 7.84126984 B 1.22 0.2226 6.42914945
5: OTHER 0.00000000 B . . .
NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.
Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59
PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 7
Bonferroni Multiple Comparisons Test included.
General Linear Models Procedure
Bonferroni (Dunn) T tests for variable: F10CIGS
NOTE: This test controls the type I experimentwise error rate but generally has a higher
type II error rate than Tukey's for all pairwise comparisons.
Alpha= 0.05 Confidence= 0.95 df= 5882 MSE= 162.7525
Critical Value of T= 2.80809
Comparisons significant at the 0.05 level are indicated by '***'.
Simultaneous Simultaneous
Lower Difference Upper
RACE Confidence Between Confidence
Comparison Limit Means Limit
4: NATIVE AMER - 1: WHITE -5.8431 7.7055 21.2542
4: NATIVE AMER - 5: OTHER -10.2124 7.8413 25.8949
4: NATIVE AMER - 2: BLACK 1.8654 15.6146 29.3639 ***
4: NATIVE AMER - 3: ORIENTAL -0.5050 18.0357 36.5765
1: WHITE - 4: NATIVE AMER -21.2542 -7.7055 5.8431
1: WHITE - 5: OTHER -11.8152 0.1357 12.0866
1: WHITE - 2: BLACK 5.4736 7.9091 10.3445 ***
1: WHITE - 3: ORIENTAL -2.3445 10.3302 23.0049
5: OTHER - 4: NATIVE AMER -25.8949 -7.8413 10.2124
5: OTHER - 1: WHITE -12.0866 -0.1357 11.8152
5: OTHER - 2: BLACK -4.4045 7.7733 19.9512
5: OTHER - 3: ORIENTAL -7.2129 10.1944 27.6018
2: BLACK - 4: NATIVE AMER -29.3639 -15.6146 -1.8654 ***
2: BLACK - 1: WHITE -10.3445 -7.9091 -5.4736 ***
2: BLACK - 5: OTHER -19.9512 -7.7733 4.4045
2: BLACK - 3: ORIENTAL -10.4678 2.4211 15.3100
3: ORIENTAL - 4: NATIVE AMER -36.5765 -18.0357 0.5050
3: ORIENTAL - 1: WHITE -23.0049 -10.3302 2.3445
3: ORIENTAL - 5: OTHER -27.6018 -10.1944 7.2129
3: ORIENTAL - 2: BLACK -15.3100 -2.4211 10.4678
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59
========================================================================================================================
Note that the table above indicates that some of the pairs of
races being compared do differ significantly in mean values of
cigarettes per day: Native American versus Black, for example.
* ==================================================================== ;
PROBLEM 1.
Refer to the data in Chapter 4 on crime rates.
Create categorical variables which represent tertiles of the
variables EX1 and W. That is, create two new variables EX1TERT
and WTERT. For example, for EX1TERT, sort the observed values of
EX1 into low, middle and high groups (of size 16, 16, and 15),
and define EX1TERT = 1 if the observation is in the low group,
EX1TERT = 2 if the observation is in the middle group, etc. Do the
same for WTERT.
Find means and standard deviations for the crime rate R in each
of the tertiles of EX1 and W. Find 95% confidence intervals for each
of these means.
Now use PROC GLM to carry out analyses of variance of the
outcome variable R versus EX1TERT and WTERT (separate analyses).
State the conclusions from your analysis.
Use the MEANS statement in PROC GLM with the BONFERRONI option
to determine whether the tertiles of EX1 and W have sigificantly
different values of R. Again describe your conclusions.
* ==================================================================== ;
n54703.007 Last update: March 6, 2006.