PROC GLM, I: One-way Analysis of Variance. n54703.007 The 'GLM' in PROC GLM stands for General Linear Model. PROC GLM can be used for analysis of variance problems, but also for regression problems and analysis of covariance. The latter is simply a mixture of analysis of variance and regression. Analysis of variance is usually thought of in terms of *factors* i.e., variables which can take on a small number of discrete values: gender is such a variable, where perhaps gender = 0 indicates male, and gender = 1 indicates female. Another such factor is race, which may be coded as: race = 1, African race = 2, European race = 3, Asian race = 4, Native American race = 5, Other Suppose you wanted to study the relationship between race and and cigarettes per day in smokers. It is actually possible to do most analysis of variance problems using PROC REG, though it is somewhat cumbersome to do so. Here is an INCORRECT approach: ------------------------------------------------------------------------ proc reg data = racesmk ; model cigs = race ; run ; ------------------------------------------------------------------------ In this analysis, race is entered as a *quantitative* predictor. There is an implied order: African < European < Asian, etc.. There is no reason to assume such an order. A better approach would be the following: data racesmk ; infile 'racesmk.dat' ; input person cigs race ; african = 0 ; european = 0 ; asian = 0 ; native = 0 ; other = 0 ; if race eq 1 then african = 1 ; if race eq 2 then european = 1 ; if race eq 3 then asian = 1 ; if race eq 4 then native = 1 ; if race eq 5 then other = 1 ; run ; proc reg data = racesmk ; model cigs = african european asian native ; run ; ------------------------------------------------------------------------ There are two important points to note about this regression. First, race is represented in the model by *indicator variables*: that is, african = 1 indicates that the person's race is African. Second, only four of the five indicator variables are entered into the regression. The fifth racial category corresponds essentially to the intercept. This regression will produce a coefficient for each of the four races entered. The coefficients are related to the means of the dependent variable as explained below. The following is a program and printout based on Lung Health Study data for bmi (body mass index) and baseline cigarettes per day versus race, using both PROC REG and PROC GLM: * ==================================================================== ; AWHITE = 0 ; ABLACK = 0 ; AORIENTL = 0 ; ANATIVE = 0 ; AOTHER = 0 ; AREFUSES = 0 ; IF RACE EQ 1 THEN AWHITE = 1 ; IF RACE EQ 2 THEN ABLACK = 1 ; IF RACE EQ 3 THEN AORIENTL = 1 ; IF RACE EQ 4 THEN ANATIVE = 1 ; IF RACE EQ 5 THEN AOTHER = 1 ; IF RACE EQ 6 THEN AREFUSES = 1 ; * ==================================================================== ; PROC FORMAT ; VALUE RACE 1 = 'WHITE' 2 = 'BLACK' 3 = 'ORIENTAL' 4 = 'NATIVE AMER' 5 = 'OTHER' 6 = 'REFUSES' ; * ==================================================================== ; proc means data = smoke n mean std stderr ; class race ; var f10cigs ; title1 'PROC MEANS: mean values of f10cigs versus race' ; format race race. ; proc reg data = smoke ; where race ne . ; model f10cigs = awhite ablack aorientl anative ; title1 'PROC REG: model F10cigs = black oriental native other' ; proc glm data = smoke ; where race ne . ; class race ; model f10cigs = race / solution ; format race race. ; title1 'PROC GLM: model F10cigs = race' ; format race race. ; endsas ; * ==================================================================== ; PROC MEANS: mean values of f10cigs versus race 18:41 Monday, March 6, 2006 1 Analysis Variable : F10CIGS CIGS PER DAY AT SCREEN 1 RACE N Obs N Mean Std Dev Std Error --------------------------------------------------------------------- 1: WHITE 5638 5638 31.5801703 12.8111185 0.1706179 2: BLACK 225 225 23.6711111 10.4841850 0.6989457 3: ORIENTAL 8 8 21.2500000 12.1037184 4.2793107 4: NATIVE AMER 7 7 39.2857143 29.2159448 11.0425892 5: OTHER 9 9 31.4444444 13.0873136 4.3624379 --------------------------------------------------------------------- LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41 PROC REG: model F10cigs = black oriental native other 2 18:41 Monday, March 6, 2006 Model: MODEL1 Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1 Analysis of Variance Sum of Mean Source DF Squares Square F Value Prob>F Model 4 14787.79995 3696.94999 22.715 0.0001 Error 5882 957310.07605 162.75248 C Total 5886 972097.87600 Root MSE 12.75745 R-square 0.0152 Dep Mean 31.27280 Adj R-sq 0.0145 C.V. 40.79406 Parameter Estimates Parameter Standard T for H0: Variable Variable DF Estimate Error Parameter=0 Prob > |T| Label INTERCEP 1 31.444444 4.25248265 7.394 0.0001 Intercept AWHITE 1 0.135726 4.25587544 0.032 0.9746 ABLACK 1 -7.773333 4.33669840 -1.792 0.0731 AORIENTL 1 -10.194444 6.19900544 -1.645 0.1001 ANATIVE 1 7.841270 6.42914945 1.220 0.2226 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41 PROC GLM: model F10cigs = race 18:41 Monday, March 6, 2006 3 General Linear Models Procedure Class Level Information Class Levels Values RACE 5 1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER Number of observations in data set = 5887 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41 PROC GLM: model F10cigs = race 18:41 Monday, March 6, 2006 4 General Linear Models Procedure Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1 Source DF Sum of Squares Mean Square F Value Pr > F Model 4 14787.79994555 3696.94998639 22.72 0.0001 Error 5882 957310.07605241 162.75247808 Corrected Total 5886 972097.87599796 R-Square C.V. Root MSE F10CIGS Mean 0.015212 40.79406 12.75744795 31.27280448 Source DF Type I SS Mean Square F Value Pr > F RACE 4 14787.79994555 3696.94998639 22.72 0.0001 Source DF Type III SS Mean Square F Value Pr > F RACE 4 14787.79994555 3696.94998639 22.72 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 31.44444444 B 7.39 0.0001 4.25248265 RACE 1: WHITE 0.13572583 B 0.03 0.9746 4.25587544 2: BLACK -7.77333333 B -1.79 0.0731 4.33669840 3: ORIENTAL -10.19444444 B -1.64 0.1001 6.19900544 4: NATIVE AMER 7.84126984 B 1.22 0.2226 6.42914945 5: OTHER 0.00000000 B . . . NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:41 ======================================================================================================================== The PROC REG and PROC GLM analyses above are one-way analyses of variance. Note that a key difference is that in PROC GLM, the variable 'race' is entered as a CLASS variable. What that means is that SAS will create individual indicator variables for each level of 'race'. There are 5 levels (no one answered 'refused'). Thus PROC GLM does the same thing as PROC REG, but it saves you some work: it creates the indicator variable automatically. Let's compare the printout from PROC MEANS, PROC REG, and PROC GLM. Note the following for BMI and F10CIGS: PROC MEANS: race = WHITE mean F10CIGS = 31.58 race = BLACK mean F10CIGS = 23.67 PROC REG : Intercept F10CIGS = 31.4444 WHITE coeff F10CIGS = 0.1357 BLACK coeff F10CIGS = -7.7733 PROC GLM : Intercept F10CIGS = 31.4444 WHITE coeff F10CIGS = 0.1357 BLACK coeff F10CIGS = -7.7733 You can tell what is going on here. In PROC REG, the sum of the intercept and the coefficient equals the mean for the race. PROC REG prints a coefficient for each of the indicator variables. PROC GLM, however, does something else. First, it orders the indicator variables alphabetically by their value-labels. That means that WHITE comes last on the list. Second, it enters only 4 of the 5 indicator variables into the analysis. That means that WHITE is the default category. Thus in PROC GLM, just as in PROC REG, you can find the mean for the category by adding the coefficient for the category to the intercept term. Now: you may want the default category to be different than that which PROC GLM chooses automatically. You can accomplish that by using a numeric code for the category such that the highest value corresponds to what you want as the default category. Or, in the FORMAT section, you can ensure that the categories are ordered as you want as follows: * ==================================================================== ; PROC FORMAT ; VALUE RACE 1 = '1-WHITE' 2 = '2-BLACK' 3 = '3-ORIENTAL' 4 = '4-NATIVE AMER' 5 = '5-OTHER' 6 = '6-REFUSES' ; * ==================================================================== ; One of the purposes in analysis of variance is to see whether means for the different categories differ significantly. PROC GLM includes tests for whether there are overall differences in the means, in the ANOVA table. Look at the printout and answer the following: Do the categories differ significantly on F10CIGS ? p = ? Note that PROC REG and PROC GLM each have advantages. PROC REG gives you the coefficients and standard errors, but it does not produce a test for whether the groups are different. PROC GLM produces an F-test for group differences, but it does not print standard errors of the coefficients. PROC GLM can also carry out multiple-comparisons tests. As noted above, there is a difference between the races in the mean number of cigs per day. This is shown by the F-test. However, all this test tells you is that you can reject the hypothesis that all the means are the same. It does not tell you which means are different. Since there are 5 race-groups, there are 10 possible different pairs of races which might be compared. Your chance of seeing significant differences between two of the groups, given that there are 10 comparisons, is s considerably higher that 0.05 unless you make some kind of adjustment for the fact that you are doing 10 comparisons. One way of making such an adjustment is to use the Bonferroni procedure. This can be done in PROC GLM as follows: ======================================================================================================================== proc glm data = smoke ; class race ; model f10cigs = race / solution ; means race / bon ; title1 'PROC GLM: Baseline cigs/day versus race ...' ; title2 'Bonferroni Multiple Comparisons Test included' ; format race race. ; run ; ======================================================================================================================== PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 5 Bonferroni Multiple Comparisons Test included. General Linear Models Procedure Class Level Information Class Levels Values RACE 5 1: WHITE 2: BLACK 3: ORIENTAL 4: NATIVE AMER 5: OTHER Number of observations in data set = 5887 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59 PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 6 Bonferroni Multiple Comparisons Test included. General Linear Models Procedure Dependent Variable: F10CIGS CIGS PER DAY AT SCREEN 1 Source DF Sum of Squares Mean Square F Value Pr > F Model 4 14787.79994555 3696.94998639 22.72 0.0001 Error 5882 957310.07605241 162.75247808 Corrected Total 5886 972097.87599796 R-Square C.V. Root MSE F10CIGS Mean 0.015212 40.79406 12.75744795 31.27280448 Source DF Type I SS Mean Square F Value Pr > F RACE 4 14787.79994555 3696.94998639 22.72 0.0001 Source DF Type III SS Mean Square F Value Pr > F RACE 4 14787.79994555 3696.94998639 22.72 0.0001 T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT 31.44444444 B 7.39 0.0001 4.25248265 RACE 1: WHITE 0.13572583 B 0.03 0.9746 4.25587544 2: BLACK -7.77333333 B -1.79 0.0731 4.33669840 3: ORIENTAL -10.19444444 B -1.64 0.1001 6.19900544 4: NATIVE AMER 7.84126984 B 1.22 0.2226 6.42914945 5: OTHER 0.00000000 B . . . NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters. LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59 PROC GLM: Baseline cigs versus race ... 18:59 Monday, March 6, 2006 7 Bonferroni Multiple Comparisons Test included. General Linear Models Procedure Bonferroni (Dunn) T tests for variable: F10CIGS NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than Tukey's for all pairwise comparisons. Alpha= 0.05 Confidence= 0.95 df= 5882 MSE= 162.7525 Critical Value of T= 2.80809 Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper RACE Confidence Between Confidence Comparison Limit Means Limit 4: NATIVE AMER - 1: WHITE -5.8431 7.7055 21.2542 4: NATIVE AMER - 5: OTHER -10.2124 7.8413 25.8949 4: NATIVE AMER - 2: BLACK 1.8654 15.6146 29.3639 *** 4: NATIVE AMER - 3: ORIENTAL -0.5050 18.0357 36.5765 1: WHITE - 4: NATIVE AMER -21.2542 -7.7055 5.8431 1: WHITE - 5: OTHER -11.8152 0.1357 12.0866 1: WHITE - 2: BLACK 5.4736 7.9091 10.3445 *** 1: WHITE - 3: ORIENTAL -2.3445 10.3302 23.0049 5: OTHER - 4: NATIVE AMER -25.8949 -7.8413 10.2124 5: OTHER - 1: WHITE -12.0866 -0.1357 11.8152 5: OTHER - 2: BLACK -4.4045 7.7733 19.9512 5: OTHER - 3: ORIENTAL -7.2129 10.1944 27.6018 2: BLACK - 4: NATIVE AMER -29.3639 -15.6146 -1.8654 *** 2: BLACK - 1: WHITE -10.3445 -7.9091 -5.4736 *** 2: BLACK - 5: OTHER -19.9512 -7.7733 4.4045 2: BLACK - 3: ORIENTAL -10.4678 2.4211 15.3100 3: ORIENTAL - 4: NATIVE AMER -36.5765 -18.0357 0.5050 3: ORIENTAL - 1: WHITE -23.0049 -10.3302 2.3445 3: ORIENTAL - 5: OTHER -27.6018 -10.1944 7.2129 3: ORIENTAL - 2: BLACK -15.3100 -2.4211 10.4678 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 06MAR06 18:59 ======================================================================================================================== Note that the table above indicates that some of the pairs of races being compared do differ significantly in mean values of cigarettes per day: Native American versus Black, for example. * ==================================================================== ; PROBLEM 1. Refer to the data in Chapter 4 on crime rates. Create categorical variables which represent tertiles of the variables EX1 and W. That is, create two new variables EX1TERT and WTERT. For example, for EX1TERT, sort the observed values of EX1 into low, middle and high groups (of size 16, 16, and 15), and define EX1TERT = 1 if the observation is in the low group, EX1TERT = 2 if the observation is in the middle group, etc. Do the same for WTERT. Find means and standard deviations for the crime rate R in each of the tertiles of EX1 and W. Find 95% confidence intervals for each of these means. Now use PROC GLM to carry out analyses of variance of the outcome variable R versus EX1TERT and WTERT (separate analyses). State the conclusions from your analysis. Use the MEANS statement in PROC GLM with the BONFERRONI option to determine whether the tertiles of EX1 and W have sigificantly different values of R. Again describe your conclusions. * ==================================================================== ; n54703.007 Last update: March 6, 2006.