PROC GLM, I:  One-way Analysis of Variance.               n54703.007


     The 'GLM' in PROC GLM stands for General Linear Model.  PROC GLM
can be used for analysis of variance problems, but also for regression
problems and analysis of covariance.  The latter is simply a mixture
of analysis of variance and regression.

     Analysis of variance is usually thought of in terms of *factors*
i.e., variables which can take on a small number of discrete values:
gender is such a variable, where perhaps gender = 0 indicates male,
and gender = 1 indicates female.  Another such factor is race, which
may be coded as:  

     race = 1, African
     race = 2, European
     race = 3, Asian
     race = 4, Native American
     race = 5, Other

     Suppose you wanted to study the relationship between race and
and cigarettes per day in smokers.  It is actually possible to do
most analysis of variance problems using PROC REG, though it is
somewhat cumbersome to do so.  Here is an INCORRECT approach:

------------------------------------------------------------------------
     proc reg data = racesmk ;
          model cigs = race ;
     run ;

------------------------------------------------------------------------

     In this analysis, race is entered as a *quantitative* predictor.
There is an implied order: African < European < Asian, etc..  There
is no reason to assume such an order.  A better approach would be
the following:

    data racesmk ;
         infile 'racesmk.dat' ;
         input person cigs race ;

         african = 0 ; european = 0 ; asian = 0 ; native = 0 ; other = 0 ;
         if race eq 1 then african = 1 ;
         if race eq 2 then european = 1 ;
         if race eq 3 then asian = 1 ;
         if race eq 4 then native = 1 ;
         if race eq 5 then other = 1 ;

    run ;

    proc reg data = racesmk ;
         model cigs = african european asian native ;
    run ;

------------------------------------------------------------------------

    There are two important points to note about this regression.
First, race is represented in the model by *indicator variables*: that
is, african = 1 indicates that the person's race is African.
Second, only four of the five indicator variables are entered into
the regression.  The fifth racial category corresponds essentially to
the intercept.  This regression will produce a coefficient for
each of the four races entered.  The coefficients are related to
the means of the dependent variable as explained below.

     The following is a program and printout based on Lung Health Study
data for bmi (body mass index) and baseline cigarettes per day
versus race, using both PROC REG and PROC GLM:

* ==================================================================== ;        


 AWHITE = 0 ; ABLACK = 0 ; AORIENTL = 0 ; ANATIVE = 0 ;
 AOTHER = 0 ; AREFUSES = 0 ;
 IF RACE EQ 1 THEN AWHITE = 1 ;
 IF RACE EQ 2 THEN ABLACK = 1 ;
 IF RACE EQ 3 THEN AORIENTL = 1 ;
 IF RACE EQ 4 THEN ANATIVE = 1 ;
 IF RACE EQ 5 THEN AOTHER = 1 ;
 IF RACE EQ 6 THEN AREFUSES = 1 ;

* ==================================================================== ;        

  PROC FORMAT ;                                                                 

       VALUE RACE     1 = 'WHITE'
                      2 = 'BLACK'
                      3 = 'ORIENTAL'
                      4 = 'NATIVE AMER'
                      5 = 'OTHER'
                      6 = 'REFUSES' ;

* ==================================================================== ;        

proc means data = smoke n mean std stderr ;
     class race ;
     var f10cigs ;
title1 'PROC MEANS: mean values of f10cigs versus race' ;
format race race. ;

proc reg data = smoke ;
     where race ne . ;
     model f10cigs = awhite ablack aorientl anative ;
title1 'PROC REG: model F10cigs = black oriental native other' ;

proc glm data = smoke ;
     where race ne . ;
     class race ;
     model f10cigs = race / solution ;
format race race. ;
title1 'PROC GLM: model F10cigs = race' ;
format race race. ;

endsas ;

* ==================================================================== ;        

                                     PROC MEANS: mean values of f10cigs versus race      18:41 Monday, March 6, 2006   1

                         Analysis Variable : F10CIGS CIGS PER DAY AT SCREEN 1


                                   RACE  N Obs     N          Mean       Std Dev     Std Error
                         ---------------------------------------------------------------------
                         1: WHITE         5638  5638    31.5801703    12.8111185     0.1706179

                         2: BLACK          225   225    23.6711111    10.4841850     0.6989457

                         3: ORIENTAL         8     8    21.2500000    12.1037184     4.2793107

                         4: NATIVE AMER      7     7    39.2857143    29.2159448    11.0425892

                         5: OTHER            9     9    31.4444444    13.0873136     4.3624379
                         ---------------------------------------------------------------------
 
 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:41
                                  PROC REG: model F10cigs = black oriental native other                                 2
                                                                                             18:41 Monday, March 6, 2006

Model: MODEL1  
Dependent Variable: F10CIGS    CIGS PER DAY AT SCREEN 1                

                                                  Analysis of Variance

                                                     Sum of         Mean
                            Source          DF      Squares       Square      F Value       Prob>F

                            Model            4  14787.79995   3696.94999       22.715       0.0001
                            Error         5882 957310.07605    162.75248
                            C Total       5886 972097.87600

                                Root MSE      12.75745     R-square       0.0152
                                Dep Mean      31.27280     Adj R-sq       0.0145
                                C.V.          40.79406

                                                  Parameter Estimates

                      Parameter      Standard    T for H0:                 Variable
     Variable  DF      Estimate         Error   Parameter=0    Prob > |T|     Label

     INTERCEP   1     31.444444    4.25248265         7.394        0.0001  Intercept                               
     AWHITE     1      0.135726    4.25587544         0.032        0.9746                                          
     ABLACK     1     -7.773333    4.33669840        -1.792        0.0731                                          
     AORIENTL   1    -10.194444    6.19900544        -1.645        0.1001                                          
     ANATIVE    1      7.841270    6.42914945         1.220        0.2226                                          

 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:41
                                              PROC GLM: model F10cigs = race              18:41 Monday, March 6, 2006   3

                                            General Linear Models Procedure
                                                Class Level Information

                        Class    Levels    Values

                        RACE          5    1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER


                                       Number of observations in data set = 5887

 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:41
                                              PROC GLM: model F10cigs = race              18:41 Monday, March 6, 2006   4

                                            General Linear Models Procedure

   Dependent Variable: F10CIGS   CIGS PER DAY AT SCREEN 1

   Source                  DF             Sum of Squares               Mean Square          F Value            Pr > F

   Model                    4             14787.79994555             3696.94998639            22.72            0.0001

   Error                 5882            957310.07605241              162.75247808

   Corrected Total       5886            972097.87599796

                     R-Square                       C.V.                  Root MSE                       F10CIGS Mean

                     0.015212                   40.79406               12.75744795                        31.27280448


   Source                  DF                  Type I SS               Mean Square          F Value            Pr > F

   RACE                     4             14787.79994555             3696.94998639            22.72            0.0001

   Source                  DF                Type III SS               Mean Square          F Value            Pr > F

   RACE                     4             14787.79994555             3696.94998639            22.72            0.0001


                                                             T for H0:             Pr > |T|            Std Error of
   Parameter                           Estimate             Parameter=0                                  Estimate

   INTERCEPT                        31.44444444 B                  7.39              0.0001              4.25248265
   RACE      1: WHITE                0.13572583 B                  0.03              0.9746              4.25587544
             2: BLACK               -7.77333333 B                 -1.79              0.0731              4.33669840
             3: ORIENTAL           -10.19444444 B                 -1.64              0.1001              6.19900544
             4: NATIVE AMER          7.84126984 B                  1.22              0.2226              6.42914945
             5: OTHER                0.00000000 B                   .                 .                   .        

NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.   
      Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.


 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:41
========================================================================================================================

     The PROC REG and PROC GLM analyses above are one-way analyses of
variance.  Note that a key difference is that in PROC GLM, the variable
'race' is entered as a CLASS variable.  What that means is that SAS
will create individual indicator variables for each level of 'race'.
There are 5 levels (no one answered 'refused').   Thus PROC GLM does
the same thing as PROC REG, but it saves you some work: it creates
the indicator variable automatically.

     Let's compare the printout from PROC MEANS, PROC REG, and PROC GLM.
Note the following for BMI and F10CIGS:

     PROC MEANS:  race = WHITE mean F10CIGS = 31.58
                  race = BLACK mean F10CIGS = 23.67

     PROC REG  :  Intercept         F10CIGS = 31.4444
                  WHITE coeff       F10CIGS =  0.1357
                  BLACK coeff       F10CIGS = -7.7733

     PROC GLM  :  Intercept         F10CIGS = 31.4444
                  WHITE coeff       F10CIGS =  0.1357
                  BLACK coeff       F10CIGS = -7.7733

     You can tell what is going on here.  In PROC REG, the sum of the
intercept and the coefficient equals the mean for the race.  PROC REG
prints a coefficient for each of the indicator variables.  PROC GLM,
however, does something else.  First, it orders the indicator variables
alphabetically by their value-labels.  That means that WHITE comes last
on the list.  Second, it enters only 4 of the 5 indicator variables
into the analysis.  That means that WHITE is the default category.
Thus in PROC GLM, just as in PROC REG, you can find the mean for the
category by adding the coefficient for the category to the intercept
term.

     Now: you may want the default category to be different than that
which PROC GLM chooses automatically.  You can accomplish that by
using a numeric code for the category such that the highest value
corresponds to what you want as the default category.  Or, in the
FORMAT section, you can ensure that the categories are ordered as you
want as follows:

* ==================================================================== ;        

  PROC FORMAT ;                                                                 

       VALUE RACE     1 = '1-WHITE'
                      2 = '2-BLACK'
                      3 = '3-ORIENTAL'
                      4 = '4-NATIVE AMER'
                      5 = '5-OTHER'
                      6 = '6-REFUSES' ;

* ==================================================================== ;        

     One of the purposes in analysis of variance is to see whether means
for the different categories differ significantly.  PROC GLM includes
tests for whether there are overall differences in the means, in the
ANOVA table.  Look at the printout and answer the following:

     Do the categories differ significantly on

             F10CIGS ?      p = ?

     Note that PROC REG and PROC GLM each have advantages.  PROC REG
gives you the coefficients and standard errors, but it does not
produce a test for whether the groups are different.  PROC GLM produces
an F-test for group differences, but it does not print standard errors
of the coefficients.

     PROC GLM can also carry out multiple-comparisons tests.  As noted
above, there is a difference between the races in the mean number of
cigs per day.  This is shown by the F-test.  However, all this test
tells you is that you can reject the hypothesis that all the means are
the same.  It does not tell you which means are different.  Since there
are 5 race-groups, there are 10 possible different pairs of races
which might be compared.  Your chance of seeing significant differences
between two of the groups, given that there are 10 comparisons, is s
considerably higher that 0.05 unless you make some kind of adjustment
for the fact that you are doing 10 comparisons.  One way of making
such an adjustment is to use the Bonferroni procedure.  This can be done
in PROC GLM as follows:

========================================================================================================================

proc glm data = smoke ;
     class race ;
     model f10cigs = race / solution ;
     means race / bon ;
title1 'PROC GLM: Baseline cigs/day versus race ...' ;
title2 'Bonferroni Multiple Comparisons Test included' ;
format race race. ;
run ;

========================================================================================================================

                                        PROC GLM: Baseline cigs versus race ...          18:59 Monday, March 6, 2006   5
                                     Bonferroni Multiple Comparisons Test included.

                                            General Linear Models Procedure
                                                Class Level Information

                        Class    Levels    Values

                        RACE          5    1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER


                                       Number of observations in data set = 5887


 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:59
                                        PROC GLM: Baseline cigs versus race ...          18:59 Monday, March 6, 2006   6
                                     Bonferroni Multiple Comparisons Test included.

                                            General Linear Models Procedure

   Dependent Variable: F10CIGS   CIGS PER DAY AT SCREEN 1

   Source                  DF             Sum of Squares               Mean Square          F Value            Pr > F

   Model                    4             14787.79994555             3696.94998639            22.72            0.0001

   Error                 5882            957310.07605241              162.75247808

   Corrected Total       5886            972097.87599796

                     R-Square                       C.V.                  Root MSE                       F10CIGS Mean

                     0.015212                   40.79406               12.75744795                        31.27280448


   Source                  DF                  Type I SS               Mean Square          F Value            Pr > F

   RACE                     4             14787.79994555             3696.94998639            22.72            0.0001

   Source                  DF                Type III SS               Mean Square          F Value            Pr > F

   RACE                     4             14787.79994555             3696.94998639            22.72            0.0001


                                                             T for H0:             Pr > |T|            Std Error of
   Parameter                           Estimate             Parameter=0                                  Estimate

   INTERCEPT                        31.44444444 B                  7.39              0.0001              4.25248265
   RACE      1: WHITE                0.13572583 B                  0.03              0.9746              4.25587544
             2: BLACK               -7.77333333 B                 -1.79              0.0731              4.33669840
             3: ORIENTAL           -10.19444444 B                 -1.64              0.1001              6.19900544
             4: NATIVE AMER          7.84126984 B                  1.22              0.2226              6.42914945
             5: OTHER                0.00000000 B                   .                 .                   .        

NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations.   
      Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.

 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:59
                                         PROC GLM: Baseline cigs versus race ...          18:59 Monday, March 6, 2006   7
                                     Bonferroni Multiple Comparisons Test included.

                                            General Linear Models Procedure

                                    Bonferroni (Dunn) T tests for variable: F10CIGS

                              NOTE: This test controls the type I experimentwise error rate but generally has a higher 
                                    type II error rate than Tukey's for all pairwise comparisons.

                                 Alpha= 0.05  Confidence= 0.95  df= 5882  MSE= 162.7525
                                              Critical Value of T= 2.80809

                           Comparisons significant at the 0.05 level are indicated by '***'.

                                                          Simultaneous            Simultaneous
                                                              Lower    Difference     Upper
                                     RACE                  Confidence    Between   Confidence
                                  Comparison                  Limit       Means       Limit

                       4: NATIVE AMER - 1: WHITE            -5.8431      7.7055     21.2542
                       4: NATIVE AMER - 5: OTHER           -10.2124      7.8413     25.8949
                       4: NATIVE AMER - 2: BLACK             1.8654     15.6146     29.3639   ***
                       4: NATIVE AMER - 3: ORIENTAL         -0.5050     18.0357     36.5765

                       1: WHITE       - 4: NATIVE AMER     -21.2542     -7.7055      5.8431
                       1: WHITE       - 5: OTHER           -11.8152      0.1357     12.0866
                       1: WHITE       - 2: BLACK             5.4736      7.9091     10.3445   ***
                       1: WHITE       - 3: ORIENTAL         -2.3445     10.3302     23.0049

                       5: OTHER       - 4: NATIVE AMER     -25.8949     -7.8413     10.2124
                       5: OTHER       - 1: WHITE           -12.0866     -0.1357     11.8152
                       5: OTHER       - 2: BLACK            -4.4045      7.7733     19.9512
                       5: OTHER       - 3: ORIENTAL         -7.2129     10.1944     27.6018

                       2: BLACK       - 4: NATIVE AMER     -29.3639    -15.6146     -1.8654   ***
                       2: BLACK       - 1: WHITE           -10.3445     -7.9091     -5.4736   ***
                       2: BLACK       - 5: OTHER           -19.9512     -7.7733      4.4045
                       2: BLACK       - 3: ORIENTAL        -10.4678      2.4211     15.3100

                       3: ORIENTAL    - 4: NATIVE AMER     -36.5765    -18.0357      0.5050
                       3: ORIENTAL    - 1: WHITE           -23.0049    -10.3302      2.3445
                       3: ORIENTAL    - 5: OTHER           -27.6018    -10.1944      7.2129
                       3: ORIENTAL    - 2: BLACK           -15.3100     -2.4211     10.4678


 
 
                                  LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 06MAR06 18:59
========================================================================================================================

     Note that the table above indicates that some of the pairs of
races being compared do differ significantly in mean values of
cigarettes per day: Native American versus Black, for example.


* ==================================================================== ;        

PROBLEM 1.

     Refer to the data in Chapter 4 on crime rates.

     Create categorical variables which represent tertiles of the
variables EX1 and W.  That is, create two new variables EX1TERT
and WTERT.  For example, for EX1TERT, sort the observed values of
EX1 into low, middle and high groups (of size 16, 16, and 15),
and define EX1TERT = 1 if the observation is in the low group,
EX1TERT = 2 if the observation is in the middle group, etc.  Do the
same for WTERT.

     Find means and standard deviations for the crime rate R in each
of the tertiles of EX1 and W.  Find 95% confidence intervals for each
of these means.

     Now use PROC GLM to carry out analyses of variance of the
outcome variable R versus EX1TERT and WTERT (separate analyses).
State the conclusions from your analysis.

     Use the MEANS statement in PROC GLM with the BONFERRONI option
to determine whether the tertiles of EX1 and W have sigificantly
different values of R.  Again describe your conclusions.

* ==================================================================== ;        
n54703.007  Last update: March 6, 2006.