PROC REG, I : Linear Regression and related topics. n54703.004 Proc reg is one of the oldest and most used of the SAS procedures, and it has an enormous number of options. There is a lot of overlap between proc reg and proc glm. In general, proc glm does more, but there are some things that proc reg will do that proc glm will not. The object in regression is to investigate the relationship between a quantitative outcome variable Y and one or more predictors. The simplest case is the one-dimensional model [1] Y = a + b*X + e, where Y is the outcome variable, X is the 'predictor', and 'a' and 'b' are unknown coefficients. The term "e" represents "measurement error". Typically one assumes that e has a normal distribution N(0, sigma^2), where the mean error is 0, and sigma represents the standard deviation of measurement. Typically sigma is an unknown positive number. If you fixed X at a certain value and measured Y over and over again, you would expect the standard deviation of the resulting Y's to be close to sigma. Note that sigma^2 is the *variance* of Y. Proc reg and most other regression procedures produce estimates of a, b, and sigma, and a number of other related statistics. In the case of model [1] above, what regression does essentially is find the straight line which best fits the data. Straight lines for a 1-dimensional model like this are determined by their intercept (a) and their slope (b). They also produce an estimate of sigma, the standard deviation of the measurement error. This estimate is usually called "s". You need at least two points of observed data to estimate a and b. To obtain a valid estimate of sigma, there must be at least three observed data points. In general the number of data points will be considerably higher than 3. Proc reg prints coefficient estimates, their standard errors, an "analysis of variance" (or "anova") table, and other statistics. Here is an example, with the lines numbered for reference later: ================================================================================== 1 options linesize = 80 ; 2 footnote "~john-c/5421/n54703.004.sas &sysdate &systime" ; 3 4 data class ; 5 input name $ height weight age @@ ; 6 datalines ; 7 Alfred 69.0 112.5 14 Alice 56.5 84.0 13 Barbara 65.3 98.0 13 8 Carol 62.8 102.5 14 Henry 63.5 102.5 14 James 57.3 83.0 12 9 Jane 59.8 84.5 12 Janet 62.5 112.5 15 Jeffrey 62.5 84.0 13 10 John 59.0 99.5 12 Joyce 51.3 50.5 11 Judy 64.3 90.0 14 11 Louise 56.3 77.0 12 Mary 66.5 112.0 15 Philip 72.0 150.0 16 12 Robert 64.8 128.0 12 Ronald 67.0 133.0 15 Thomas 57.5 85.0 11 13 William 66.5 112.0 15 14 ; 15 16 run ; 17 18 options pagesize = 35 ; 19 20 proc plot data = class ; 21 plot weight*height ; 22 title1 "Plot of weight vs height in 19 schoolchildren" ; 23 run ; 24 25 proc reg data = class ; 26 model weight = height ; 27 output out = regout residual = wresid rstudent = wstudent 28 dffits = wdffits predicted = wpred ; 29 title1 "Simple linear regression: Weight vs. Height in schoolchildren" ; 30 run ; 31 32 proc plot data = regout ; 33 plot wresid * height ; 34 title1 "Plot of regression residuals vs. height" ; 35 run ; 36 37 proc print data = regout ; 38 var age height weight wpred wresid wstudent wdffits ; 39 title1 40 "Printout of diagnostics: predicted, residuals, student resids, dffits" ; 41 run ; 42 43----------------------------------------------------------------------------- 44 45 Plot of weight vs height in 19 schoolchildren 1 46 17:46 Friday, January 30, 2004 47 48 Plot of WEIGHT*HEIGHT. Legend: A = 1 obs, B = 2 obs, etc. 49 50 WEIGHT | 51 150 + A 52 | 53 | 54 | A 55 | A 56 125 + 57 | 58 | A A 59 | B 60 | A A 61 100 + A A 62 | 63 | A 64 | A AA A A 65 | 66 75 + A 67 | 68 | 69 | 70 | 71 50 + A 72 -+-------------+-------------+-------------+-------------+-------------+- 73 50 55 60 65 70 75 74 75 HEIGHT 76 77 78 79 ~john-c/5421/n54703.004.sas 30JAN04 17:46 80 Simple linear regression: Weight vs. Height in schoolchildren 2 81 17:46 Friday, January 30, 2004 82 83 Model: MODEL1 84 Dependent Variable: WEIGHT 85 86 Analysis of Variance 87 88 Sum of Mean 89 Source DF Squares Square F Value Prob>F 90 91 Model 1 7193.24912 7193.24912 57.076 0.0001 92 Error 17 2142.48772 126.02869 93 C Total 18 9335.73684 94 95 Root MSE 11.22625 R-square 0.7705 96 Dep Mean 100.02632 Adj R-sq 0.7570 97 C.V. 11.22330 98 99 Parameter Estimates 100 101 Parameter Standard T for H0: 102 Variable DF Estimate Error Parameter=0 Prob > |T| 103 104 INTERCEP 1 -143.026918 32.27459130 -4.432 0.0004 105 HEIGHT 1 3.899030 0.51609395 7.555 0.0001 106 107 108 109 ~john-c/5421/n54703.004.sas 30JAN04 17:46 110 Plot of regression residuals vs. height 3 111 17:46 Friday, January 30, 2004 112 113 Plot of WRESID*HEIGHT. Legend: A = 1 obs, B = 2 obs, etc. 114 115 | 116 20 + 117 | A 118 | 119 | A 120 | A A A 121 10 + 122 R | 123 e | A 124 s | A 125 i | A 126 d 0 + A A 127 u | A 128 a | B 129 l | A A 130 | 131 -10 + 132 | 133 | A A 134 | A 135 | A 136 -20 + 137 --+-------------+-------------+-------------+-------------+-------------+- 138 50 55 60 65 70 75 139 140 HEIGHT 141 142 143 144 ~john-c/5421/n54703.004.sas 30JAN04 17:46 145 Printout of diagnostics: predicted, residuals, student resids, dffits 4 146 17:46 Friday, January 30, 2004 147 148 OBS AGE HEIGHT WEIGHT WPRED WRESID WSTUDENT WDFFITS 149 150 1 14 69.0 112.5 126.006 -13.5062 -1.33150 -0.55156 151 2 13 56.5 84.0 77.268 6.7317 0.62942 0.23750 152 3 13 65.3 98.0 111.580 -13.5798 -1.27834 -0.35390 153 4 14 62.8 102.5 101.832 0.6678 0.05931 0.01404 154 5 14 63.5 102.5 104.562 -2.0615 -0.18350 -0.04448 155 6 12 57.3 83.0 80.388 2.6125 0.23923 0.08249 156 7 12 59.8 84.5 90.135 -5.6351 -0.50799 -0.13529 157 8 15 62.5 112.5 100.662 11.8375 1.08931 0.25690 158 9 13 62.5 84.0 100.662 -16.6625 -1.59234 -0.37553 159 10 12 59.0 99.5 87.016 12.4841 1.16942 0.33577 160 11 11 51.3 50.5 56.993 -6.4933 -0.68541 -0.45950 161 12 14 64.3 90.0 107.681 -17.6807 -1.71545 -0.43638 162 13 12 56.3 77.0 76.488 0.5115 0.04739 0.01829 163 14 15 66.5 112.0 116.259 -4.2586 -0.38743 -0.12129 164 15 16 72.0 150.0 137.703 12.2967 1.28918 0.74426 165 16 12 64.8 128.0 109.630 18.3698 1.80087 0.47660 166 17 15 67.0 133.0 118.208 14.7919 1.42979 0.47285 167 18 11 57.5 85.0 81.167 3.8327 0.35087 0.11830 168 19 15 66.5 112.0 116.259 -4.2586 -0.38743 -0.12129 169 170 171 ~john-c/5421/n54703.004.sas 30JAN04 17:46 =============================================================================== Notes on the program: Line 5: note that the input statement is designed to read in multiple records per line of the 'datalines'. Again, note the '@@' at the end of line 5, indicating that data on the current continues to be read until the end of the line is encountered. Line 20: As always, it is a good idea to plot the data BEFORE you carry out an analysis. This may tell you that a simple linear model is not appropriate. Lines 25-28: In this case proc reg is used to study the relationship of weight to height in schoolchildren aged 11 to 16. The model statement tells you that the model is: weight = a + b*height + e, where e represents variability in weight that is not accounted for by height, or "measurement error". Measurement error is not really the right description here. Most of the deviation away from the linear trend is due to interperson variability in this case rather than to measurement error: to put it crudely, some kids are thin and others are fat. Height is an important predictor of weight, but clearly there is variability in body shape and other factors that affect weight also. Lines 27-28: The output statement is used to study some regression diagnostics which will be printed later. The output statement creates another SAS dataset, in this case called "regout" which includes: 1. The original variables, height and weight 2. residuals: residuals are the difference between the observed weights and the weights which are predicted by the regresssion equation. 3. studentized residuals, wstudent. These are used to judge whether some observations have an unusually large effect on the regression parameters. If the studentized residual for a given observation is greater than 2 in absolute value, that observation should be checked to be sure that the data are correct. In some cases it may be advisable to re-run the regression with such observations deleted. 4. dffits statistic: this is another method for determining the influence of a single observation. It is computed by doing the regression with that observation omitted, and seeing how much difference that makes in the parameter estimates. Again if dffits is greater than 2 in absolute value, it suggest that this is an unduly influential observation. 5. predicted value, wpred: in this case, this is the predicted value of weight given height based on the coefficients; that is, pred weight = a# + b# * height, where a# and b# are the regression estimates of the intercept and slope respectively. Note that the residual at a given value of height is: resid = pred weight - observed weight. Line 32: proc plot data = regout: This plot is based on the output dataset from proc reg. Line 33: plot of wresid versus height. The residual plot can be very useful for detecting systematic deviations away from the hypothesized model, or for detecting dependence between the size of the residuals and values of predictors. Seeing patterns in the residuals may cause you to add terms to the model or to transform the "Y" variable. The set of residuals adds up to zero. Residuals are sometimes plotted against the predicting variable (height, in this case), which is useful for determining whether other terms need to be added to the model, or against the predicted value, which may be useful in determining whether a trans- mation of the outcome variable is needed. ------------------------------------------------------------------------ Notes on Printout: Lines 43-79: Proc plot to show the relationship of weight to height: as always, it is a good idea to graph the data and use descriptive stats before you carry out parameter estimation and testing. In this case, one sees an obvious increasing trend in weight with height. It appears to be approximately linear and there are no obvious outliers or highly influential points. The linear model which will be used in proc reg, model weight = height <---> weight = a + b*height + e seems reasonable. The dataset is small (19 observations). Lines 86-93: The 'Analysis of Variance' table for a least-squares regression like this one is based on three sums of squares: 1. Model sum of squares (also called Regression SS): Sum of (pred y - overall mean y)^2 across all the observations. 2. Error sum of squares (also called Residual SS): Sum of (observed y - pred y)^2 across all the observations. 3. 'Corrected' total sum of squares ("C total" SS ): Sum of (observed y - overall mean y)^2. The first two of these add up to the third: Regr SS + Error SS = C total SS If Regr SS is large and Error SS is small, then the regression is accounting for most of the variability. 4. The Degrees of Freedom (DF) for these sums of squares are: Regr SS DF = (no. of params - 1) = 2 - 1 = 1. C total SS DF = (no. of obs - 1) = 19 - 1 = 18. Error SS DF = (no. of obs - no. of params) = 19 - 2 = 17. 5. The mean squares are the sums of squares divided by their associated degrees of freedom: for example, Error Mean Square = 2142.48772 / 17 = 126.02689, etc. 6. The F-statistic is the quotient of the Regr MS by the Error MS: Regr SS / (p - 1) 7193.24912 F= ------------------ = ---------- = 57.06. Error SS / (n - p) 126.02869 7. The p-value is the probability of seeing an F-value as large as, or larger than, the observed, if the null hypothesis is true. The null hypothesis being tested here is that there is no real relationship between weight and height. That translates into the statement that the regression coefficient b (i.e., slope) is zero: H0: b = 0. Here the p-value is < 0.0001. One would reject the null hypothesis and conclude that weight is related to height. Line 95: Root MSE is the square root of the Error Mean Square: that is, in this case, Root MSE = sqrt(126.02869) = 11.22625. This is the regression estimate of the standard deviation sigma of the "measurement error". This is often called "s". Another way to write it is as: sqrt[(Error SS)/(n - p)]. Line 95: R-square is the quotient of (Regr SS) / (C Total SS): here, R-square = 7193.24912 / 9335.73684 = 0.7705. Note that R-square is always less than or equal to 1. R-square will equal 1 only when the Error SS is zero, which means that all the observations lie exactly on a straight line. In general, it is often said that "R-square is the proportion of the variability in the outcome variable that is accounted for by the regression." In this case, about 77%. Lines 104-105: Here are shown the parameter estimates, a# = intercept and b# = slope, and other statistics. In this case the prediction equation for weight as a function of height is: weight = -143.027 + 3.899*height, which means that if height increases by 1 inch, you would expect weight to increase by 3.899 pounds. The Standard Errors of the coefficients are given also. These standard errors will be large if there is a lot of variability around the regression line (large Error SS) and small if the observed data points are very close to the regression line (small Error SS). T for H0: parameter = 0: This t-statistic is computed as the quotient of the parameter estimate divided by its standard error. In this case, it is compared to a t-distribution with 1 degree of freedom. If the t-statistic is large in absolute value, one rejects the null hypothesis. The null hypothesis is that the true parameter is zero. Prob > |T|: this is the probability that you would observe a t-statistic with absolute value as large as, or larger than, the observed value, if in fact the null hypothesis were true. In this case one rejects both of the null hypotheses: Intercept 'a' is zero: p-value = 0.0004. Slope 'b' is zero : p-value < 0.0001. Notice that the latter test is identical to the F-test in the ANOVA table. It is not a coincidence that T (slope) = 7.555 = sqrt(F-statistic) = sqrt(57.076). Lines 113-144: Plot of residuals versus predictor (height): Note that residual = pred y - obsd y. It is a fact that the mean of the residuals is always zero. Therefore, in residual plots, one expects them to be about equally scattered above and below the horizontal line through 0 on the vertical axis. In this case there is no obvious pattern or outliers among the residuals. An outlier would represent an observation which is far off of the regression line. There however a formal definition of outlier which will be discussed below. Lines 148-168: Printout of observations, predicted values, and statistics related to influence diagnostics. Note that all of these are on the output dataset 'regout' produced by proc reg in lines 27-28. Note here that the predicted values for weight (WPRED) are as given above, weight = -143.027 + 3.899*height, The residuals are the difference between the predicted weight and the observed weight. The variable 'WSTUDENT' is the 'Studentized residual'. This provides a kind of normalized measure of how far the observed weight differs from the predicted weight. As noted above, if this is > 2 in absolute value, it indicates that the point is an (perhaps unduly) influential point. In this case the largest absolute value of WSTUDENT is 1.80087. The variable WDFFITS is another measure of influence, and again, one expects that it is less that 2 in absolute value. The largest absolute values of WDFFITS here is 0.74426. As noted above, if WSTUDENT or WDFFITS had included some values that were large in absolute value, there are two actions you would take: (1) check the data to be sure there are no errors, and (2) consider re-running the regression with exclusion of the influential points, and see what kind of difference it makes in the F-statistic, the R-square, and the parameter estimates. You can perform this second regression using the dataset 'regout' as follows: proc reg data = regout ; model weight = height ; where abs(wstudent) le 2.00 and abs(dffits) le 2.00 ; title1 'Regression of weight on height, excluding influential points' ; In this case, the regression would not change because the criteria in the "where" statement are never met. ------------------------------------------------------------------------ Data Transformations: One may want to transform either the outcome variable (Y) or the predictor (X). In general the reasons for performing transformations are the following: 1. To change what appears to be a curvilinear relationship into a linear one. 2. To attain "homoscedasticity": "Homoscedasticity" means that the measurement error has the same variability regardless of the values of the outcome variable and the predictor. It often happens in medical or biological studies that the variability of a measured outcome is proportional to the size of that outcome. An example is FEV1, a measure of lung function, defined at the volume of air in liters that a person can blow out in 1 second when making a maximal effort. A large healthy person might have an FEV1 of 5.00 liters, while a small, sickly or aged person's FEV1 might be 1.5 liters. The measurement of FEV1 - unlike, say, weight or height - has a lot of random variability. That is, two consecutive measurements of FEV1 for the same person may easily differ by as much 10%. For the large person with FEV1 = 5.000, this would amount to saying sigma = .1*5.00 = 0.50 liters. For the small person with FEV1 = 1.50, this yields sigma = .1*1.50 = 0.15 liters. ANOVA tests and t-tests like those carried out in proc reg, are based on the assumption of homoscedasticity. If homoscedasticity is badly violated, these tests may give misleading results. Frequently investigators try to eliminate homoscedasticity by transforming the outcome variable with a log transformation. One might replace FEV1 by log(FEV1), and perform the regression with this outcome variable rather than the original. A way to detect non-homoscedasticity is to plot residuals for the outcome variable against the predicted values. See below for examples. Another way is to use options in proc reg called SPEC or ACOV on the model statement. Again see below for an example. ======================================================================== Transformations The following graph illustrates the problem of heteroscedasticity: http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf1 Here the variability in the dependent variable clearly increases as the dependent variable itself increases. The plot referenced above shows an approximately linear relationship between the dependent variable Y as a function of the independent variable X. The smooth curve superimposed on the data was generated by a SAS smoothing procedure called proc loess. The next graph, at http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf2 shows the same data, but the dependent variable Y has been log- transformed and replaced by log(Y). The X-variable is unchanged. As you can see, the variability is somewhat "tamed down". This, however, has come at a price. The formerly linear relationship between Y and X does not look so linear when Y is replaced by log(Y). Further, some of the observations are far below the smoothed curve that was fit to the data. Transformations can help in some cases, but in this case the log-transformation may have done more harm than good. ======================================================================== Outliers: It is easy to confuse outliers and influential points because it will often happen that they are the same points. An influential point has the following characteristic: if the regression is done with the influential point omitted, there is a resulting large change in the regression estimates of the parameters. Thus in most cases an influential point will not be close to the regression line determined by all the other points. When the influential point is included in the regression, that regression line changes substantially. This is what the influence diagnostic DFFITS measures. An outlier also will not be close to the regression line determined by the other points, but if it is included, it need not have much effect on the regression line. A point can be both an influential point and an outlier. The effect of including outliers in a regression is to inflate the estimate of Root MSE, or "s". This in turn increases the standard errors of the parameter estimates, decreases the F-statistic and the R-square, and decreases the power of the regression procedure to detect relationships of the outcome variable to the independent variables. Outliers in regression are defined as follows. Let S be the set of residuals from a regression. Suppose you constructed a boxplot for the set S. The distance between the 25th percentile and the 75th percentile in the boxplot is called the *interquartile range*, or IQR. A value which is more than 1.5*IQR above the median is called an outlier. Similarly a value which are more than 1.5*IQR units below the median is called an outlier. This leads to the following procedure to detect outliers in regression: 1. Regress Y on the independent variable(s). 2. Output the residuals to a dataset. 3. Find the 25th percentile, the median, and the 75th percentile (using proc univariate) of the residuals. 4. Find which points correspond to residuals which are either: (i) above median + 1.5*IQR, or (ii) below median - 1.5*IQR. Below is a program which shows how this works for data from the Lung Health Study. The participants were 161 women, aged 35-60. Lung function (FEV1) was regressed on age. There is a plot at: http://www.biostat.umn.edu/~john-c/5421/n54703.004.3.grf which indicates the relationship of FEV1 and age. ======================================================================== options linesize = 80 ; PROC REG DATA = SMOKE ; WHERE NCASE GE 600 AND NCASE LE 1000 AND GENDER EQ 1 ; MODEL S2FEVPRE = AGE ; OUTPUT OUT = REGOUT R = RESFEV1 P = PDFEV1 DFFITS = DFFFEV1 ; TITLE1 'REGRESSION OF FEV1 ON AGE - WOMEN ONLY' ; PROC SORT DATA = REGOUT ; BY RESFEV1 ; PROC PRINT DATA = REGOUT ; VAR NCASE AGE S2FEVPRE PDFEV1 RESFEV1 DFFFEV1 ; TITLE1 'SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE' ; PROC UNIVARIATE DATA = REGOUT PLOT NORMAL ; VAR RESFEV1 ; TITLE1 'PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE' ; proc loess data = regout ; model s2fevpre = age / smooth = .2 ; ods output OutputStatistics = Results ; proc sort data = Results ; by age ; proc print data = Results ; var age DepVar Pred ; symbol1 c = black v = dot w = .7 h = .7 ; symbol2 c = black i = j v = none l = 1 w = 1 ; GOPTIONS DEVICE = GIF ; PROC GPLOT DATA = Results ; PLOT DepVar * AGE Pred * AGE/ OVERLAY HAXIS = AXIS1 ; AXIS1 ORDER = 34 TO 62 BY 4 ; TITLE1 H = 2 'PLOT OF FEV1 VERSUS AGE: WOMEN ONLY' ; RUN ; PROC SORT DATA = REGOUT ; BY AGE ; ENDSAS ; ------------------------------------------------------------------------ REGRESSION OF FEV1 ON AGE - WOMEN ONLY 1 18:57 Saturday, January 31, 2004 The REG Procedure Model: MODEL1 Dependent Variable: S2FEVPRE Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 3.03831 3.03831 39.34 <.0001 Error 159 12.27991 0.07723 Corrected Total 160 15.31822 Root MSE 0.27791 R-Square 0.1983 Dependent Mean 2.07478 Adj R-Sq 0.1933 Coeff Var 13.39450 Parameter Estimates Parameter Standard Variable Label DF Estimate Error t Value Intercept Intercept 1 3.10145 0.16515 18.78 age AGE AT ENTRY INTO LHS 1 -0.02102 0.00335 -6.27 Parameter Estimates Variable Label DF Pr > |t| Intercept Intercept 1 <.0001 age AGE AT ENTRY INTO LHS 1 <.0001 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57 SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE 2 18:57 Saturday, January 31, 2004 Obs NCASE age S2FEVPRE PDFEV1 RESFEV1 DFFFEV1 1 734 56 1.30 1.92439 -0.62439 -0.26972 2 719 50 1.49 2.05050 -0.56050 -0.16407 3 687 53 1.43 1.98744 -0.55744 -0.19083 ------------------------------ observations omitted ------------------- 158 703 59 2.32 1.86133 0.45867 0.24696 159 810 52 2.48 2.00846 0.47154 0.15055 160 997 54 2.51 1.96642 0.54358 0.20016 161 995 43 2.77 2.19763 0.57237 0.22255 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57 PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 6 18:57 Saturday, January 31, 2004 The UNIVARIATE Procedure Variable: RESFEV1 (Residual) Moments N 161 Sum Weights 161 Mean 0 Sum Observations 0 Std Deviation 0.27703684 Variance 0.07674941 Skewness -0.2361922 Kurtosis -0.8066874 Uncorrected SS 12.2799057 Corrected SS 12.2799057 Coeff Variation . Std Error Mean 0.02183356 Basic Statistical Measures Location Variability Mean 0.00000 Std Deviation 0.27704 Median 0.02561 Variance 0.07675 Mode -0.25356 Range 1.19675 Interquartile Range 0.42306 NOTE: The mode displayed is the smallest of 6 modes with a count of 2. Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 0 Pr > |t| 1.0000 Sign M 7.5 Pr >= |M| 0.2698 Signed Rank S 144.5 Pr >= |S| 0.8082 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.975985 Pr < W 0.0066 Kolmogorov-Smirnov D 0.066321 Pr > D 0.0827 Cramer-von Mises W-Sq 0.177031 Pr > W-Sq 0.0102 Anderson-Darling A-Sq 1.121213 Pr > A-Sq 0.0063 Quantiles (Definition 5) Quantile Estimate 100% Max 0.5723670 99% 0.5435764 95% 0.4223670 90% 0.3405193 75% Q3 0.2105193 50% Median 0.0256145 25% Q1 -0.2125378 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57 PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 7 18:57 Saturday, January 31, 2004 The UNIVARIATE Procedure Variable: RESFEV1 (Residual) Quantiles (Definition 5) Quantile Estimate 10% -0.4055950 5% -0.4706902 1% -0.5604997 0% Min -0.6243855 Extreme Observations ------Lowest------ ------Highest----- Value Obs Value Obs -0.624385 1 0.443576 157 -0.560500 2 0.458672 158 -0.557443 3 0.471538 159 -0.527633 4 0.543576 160 -0.525595 5 0.572367 161 Stem Leaf # Boxplot 5 7 1 | 5 4 1 | 4 67 2 | 4 2223334 7 | 3 67999 5 | 3 0001234 7 | 2 567777799 9 | 2 0001113333444 13 +-----+ 1 7888999 7 | | 1 012223333344444 15 | | 0 56677888899 11 | | 0 1111222334 10 *--+--* -0 444322211 9 | | -0 9987765 7 | | -1 444211 6 | | -1 99666655 8 | | -2 3222100 7 +-----+ -2 7765555 7 | -3 44332 5 | -3 99776 5 | -4 4221100 7 | -4 8877655 7 | -5 33 2 | -5 66 2 | -6 2 1 | ----+----+----+----+ Multiply Stem.Leaf by 10**-1 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57 PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 8 18:57 Saturday, January 31, 2004 The UNIVARIATE Procedure Variable: RESFEV1 (Residual) Normal Probability Plot 0.575+ ++ * | + * | ++ ** | ***** | *** | +** | *** | ***+ | **+ | ***+ | **+ | **+ -0.025+ *** | **+ | ** | +** | +*** | +** | ++** | ++** | +*** | **** | ** | * *+ -0.625+* ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57 ======================================================================== Note that the printout of proc univariate identifies the inter- quartile range and the median of the residuals: they are IQR = .42306 Median = .02561. Therefore outliers in the residuals are those with values either: above Median + 1.5*IQR = .6602, or below Median - 1.5*IQR = -.6090. From the printout above, as it happens, there is only one observation which has a residual that would be considered an outlier: Obs NCASE age S2FEVPRE PDFEV1 RESFEV1 DFFFEV1 1 734 56 1.30 1.92439 -0.62439 -0.26972 See if you can locate this observation on the graph below! ================================================================================== PLOT OF FEV1 VERSUS AGE: 161 LUNG HEALTH STUDY WOMEN 16 19:46 Saturday, January 31, 2004 Plot of DepVar*age. Legend: A = 1 obs, B = 2 obs, etc. | 3.00000 + | | | A | A A A | A A A A A A A 2.50000 + A A A A A A A | A A A A A A S | A A A B A A C A B A A 2 | A A A A A C A B A B B A B A F | A B E A B B A A A B A E | A A A B C A B C V 2.00000 + A B B A B A A B A A P | A A A A A B C A R | A A B A A B A B B E | B A B A A A A A A | A A A B | A B A A 1.50000 + A A A A | A B A A | A | | | 1.00000 + | ---+---------+---------+---------+---------+---------+-- 35.00000 40.00000 45.00000 50.00000 55.00000 60.00000 AGE AT ENTRY INTO LHS LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 19:46 ================================================================================== The actions that can be taken for outliers in residuals are the following: 1. Check the data for errors if possible. 2. Re-run the regression excluding the outliers and see whether it makes any substantial difference in the F-test or the R-square or the T-tests for the coefficients. ================================================================================== Problem 1: Using the data on mortality and water hardness in Chapter 2 of the Der-Everitt text, assume the model mortality = a + b*hardness + e, where the error term e has a normal distribution with mean 0 and standard deviation sigma. Use SAS and proc reg to analyze this data. (1) Plot mortality versus water hardness. (2) Compute predicted mortality and residuals and give a plot of residuals versus hardness (3) Test for influential points and outliers. (4) What does the F-test tell you? (5) Find 95% confidence intervals for the coefficient estimates. (6) What proportion of the variability in mortality is accounted for by water hardness? (7) What is the regression estimate of sigma ? n54703.004 Last update: February 7, 2005.