PROC REG, I : Linear Regression and related topics. n54703.004
Proc reg is one of the oldest and most used of the SAS procedures,
and it has an enormous number of options. There is a lot of overlap between
proc reg and proc glm. In general, proc glm does more, but there are some
things that proc reg will do that proc glm will not.
The object in regression is to investigate the relationship between
a quantitative outcome variable Y and one or more predictors. The simplest
case is the one-dimensional model
[1] Y = a + b*X + e,
where Y is the outcome variable, X is the 'predictor', and 'a' and 'b' are
unknown coefficients. The term "e" represents "measurement error".
Typically one assumes that e has a normal distribution N(0, sigma^2),
where the mean error is 0, and sigma represents the standard deviation
of measurement. Typically sigma is an unknown positive number. If you
fixed X at a certain value and measured Y over and over again, you would
expect the standard deviation of the resulting Y's to be close to sigma.
Note that sigma^2 is the *variance* of Y.
Proc reg and most other regression procedures produce estimates of
a, b, and sigma, and a number of other related statistics. In the case
of model [1] above, what regression does essentially is find the straight
line which best fits the data. Straight lines for a 1-dimensional model
like this are determined by their intercept (a) and their slope (b).
They also produce an estimate of sigma, the standard deviation of the
measurement error. This estimate is usually called "s". You need at least
two points of observed data to estimate a and b. To obtain a valid estimate
of sigma, there must be at least three observed data points. In general
the number of data points will be considerably higher than 3.
Proc reg prints coefficient estimates, their standard errors, an
"analysis of variance" (or "anova") table, and other statistics.
Here is an example, with the lines numbered for reference later:
==================================================================================
1 options linesize = 80 ;
2 footnote "~john-c/5421/n54703.004.sas &sysdate &systime" ;
3
4 data class ;
5 input name $ height weight age @@ ;
6 datalines ;
7 Alfred 69.0 112.5 14 Alice 56.5 84.0 13 Barbara 65.3 98.0 13
8 Carol 62.8 102.5 14 Henry 63.5 102.5 14 James 57.3 83.0 12
9 Jane 59.8 84.5 12 Janet 62.5 112.5 15 Jeffrey 62.5 84.0 13
10 John 59.0 99.5 12 Joyce 51.3 50.5 11 Judy 64.3 90.0 14
11 Louise 56.3 77.0 12 Mary 66.5 112.0 15 Philip 72.0 150.0 16
12 Robert 64.8 128.0 12 Ronald 67.0 133.0 15 Thomas 57.5 85.0 11
13 William 66.5 112.0 15
14 ;
15
16 run ;
17
18 options pagesize = 35 ;
19
20 proc plot data = class ;
21 plot weight*height ;
22 title1 "Plot of weight vs height in 19 schoolchildren" ;
23 run ;
24
25 proc reg data = class ;
26 model weight = height ;
27 output out = regout residual = wresid rstudent = wstudent
28 dffits = wdffits predicted = wpred ;
29 title1 "Simple linear regression: Weight vs. Height in schoolchildren" ;
30 run ;
31
32 proc plot data = regout ;
33 plot wresid * height ;
34 title1 "Plot of regression residuals vs. height" ;
35 run ;
36
37 proc print data = regout ;
38 var age height weight wpred wresid wstudent wdffits ;
39 title1
40 "Printout of diagnostics: predicted, residuals, student resids, dffits" ;
41 run ;
42
43-----------------------------------------------------------------------------
44
45 Plot of weight vs height in 19 schoolchildren 1
46 17:46 Friday, January 30, 2004
47
48 Plot of WEIGHT*HEIGHT. Legend: A = 1 obs, B = 2 obs, etc.
49
50 WEIGHT |
51 150 + A
52 |
53 |
54 | A
55 | A
56 125 +
57 |
58 | A A
59 | B
60 | A A
61 100 + A A
62 |
63 | A
64 | A AA A A
65 |
66 75 + A
67 |
68 |
69 |
70 |
71 50 + A
72 -+-------------+-------------+-------------+-------------+-------------+-
73 50 55 60 65 70 75
74
75 HEIGHT
76
77
78
79 ~john-c/5421/n54703.004.sas 30JAN04 17:46
80 Simple linear regression: Weight vs. Height in schoolchildren 2
81 17:46 Friday, January 30, 2004
82
83 Model: MODEL1
84 Dependent Variable: WEIGHT
85
86 Analysis of Variance
87
88 Sum of Mean
89 Source DF Squares Square F Value Prob>F
90
91 Model 1 7193.24912 7193.24912 57.076 0.0001
92 Error 17 2142.48772 126.02869
93 C Total 18 9335.73684
94
95 Root MSE 11.22625 R-square 0.7705
96 Dep Mean 100.02632 Adj R-sq 0.7570
97 C.V. 11.22330
98
99 Parameter Estimates
100
101 Parameter Standard T for H0:
102 Variable DF Estimate Error Parameter=0 Prob > |T|
103
104 INTERCEP 1 -143.026918 32.27459130 -4.432 0.0004
105 HEIGHT 1 3.899030 0.51609395 7.555 0.0001
106
107
108
109 ~john-c/5421/n54703.004.sas 30JAN04 17:46
110 Plot of regression residuals vs. height 3
111 17:46 Friday, January 30, 2004
112
113 Plot of WRESID*HEIGHT. Legend: A = 1 obs, B = 2 obs, etc.
114
115 |
116 20 +
117 | A
118 |
119 | A
120 | A A A
121 10 +
122 R |
123 e | A
124 s | A
125 i | A
126 d 0 + A A
127 u | A
128 a | B
129 l | A A
130 |
131 -10 +
132 |
133 | A A
134 | A
135 | A
136 -20 +
137 --+-------------+-------------+-------------+-------------+-------------+-
138 50 55 60 65 70 75
139
140 HEIGHT
141
142
143
144 ~john-c/5421/n54703.004.sas 30JAN04 17:46
145 Printout of diagnostics: predicted, residuals, student resids, dffits 4
146 17:46 Friday, January 30, 2004
147
148 OBS AGE HEIGHT WEIGHT WPRED WRESID WSTUDENT WDFFITS
149
150 1 14 69.0 112.5 126.006 -13.5062 -1.33150 -0.55156
151 2 13 56.5 84.0 77.268 6.7317 0.62942 0.23750
152 3 13 65.3 98.0 111.580 -13.5798 -1.27834 -0.35390
153 4 14 62.8 102.5 101.832 0.6678 0.05931 0.01404
154 5 14 63.5 102.5 104.562 -2.0615 -0.18350 -0.04448
155 6 12 57.3 83.0 80.388 2.6125 0.23923 0.08249
156 7 12 59.8 84.5 90.135 -5.6351 -0.50799 -0.13529
157 8 15 62.5 112.5 100.662 11.8375 1.08931 0.25690
158 9 13 62.5 84.0 100.662 -16.6625 -1.59234 -0.37553
159 10 12 59.0 99.5 87.016 12.4841 1.16942 0.33577
160 11 11 51.3 50.5 56.993 -6.4933 -0.68541 -0.45950
161 12 14 64.3 90.0 107.681 -17.6807 -1.71545 -0.43638
162 13 12 56.3 77.0 76.488 0.5115 0.04739 0.01829
163 14 15 66.5 112.0 116.259 -4.2586 -0.38743 -0.12129
164 15 16 72.0 150.0 137.703 12.2967 1.28918 0.74426
165 16 12 64.8 128.0 109.630 18.3698 1.80087 0.47660
166 17 15 67.0 133.0 118.208 14.7919 1.42979 0.47285
167 18 11 57.5 85.0 81.167 3.8327 0.35087 0.11830
168 19 15 66.5 112.0 116.259 -4.2586 -0.38743 -0.12129
169
170
171 ~john-c/5421/n54703.004.sas 30JAN04 17:46
===============================================================================
Notes on the program:
Line 5: note that the input statement is designed to read in multiple
records per line of the 'datalines'. Again, note the '@@'
at the end of line 5, indicating that data on the current
continues to be read until the end of the line is encountered.
Line 20: As always, it is a good idea to plot the data BEFORE you carry
out an analysis. This may tell you that a simple linear model
is not appropriate.
Lines 25-28: In this case proc reg is used to study the relationship of
weight to height in schoolchildren aged 11 to 16. The model
statement tells you that the model is:
weight = a + b*height + e,
where e represents variability in weight that is not accounted
for by height, or "measurement error". Measurement error is not
really the right description here. Most of the deviation away
from the linear trend is due to interperson variability in this
case rather than to measurement error: to put it crudely, some kids
are thin and others are fat. Height is an important predictor of
weight, but clearly there is variability in body shape and other
factors that affect weight also.
Lines 27-28: The output statement is used to study some regression
diagnostics which will be printed later. The output statement
creates another SAS dataset, in this case called "regout"
which includes:
1. The original variables, height and weight
2. residuals: residuals are the difference between the
observed weights and the weights which are predicted
by the regresssion equation.
3. studentized residuals, wstudent. These are used to
judge whether some observations have an unusually
large effect on the regression parameters. If the
studentized residual for a given observation is
greater than 2 in absolute value, that observation
should be checked to be sure that the data are
correct. In some cases it may be advisable to re-run
the regression with such observations deleted.
4. dffits statistic: this is another method for determining
the influence of a single observation. It is computed
by doing the regression with that observation omitted,
and seeing how much difference that makes in the
parameter estimates. Again if dffits is greater than
2 in absolute value, it suggest that this is an unduly
influential observation.
5. predicted value, wpred: in this case, this is the
predicted value of weight given height based on the
coefficients; that is,
pred weight = a# + b# * height,
where a# and b# are the regression estimates of
the intercept and slope respectively.
Note that the residual at a given value of height is:
resid = pred weight - observed weight.
Line 32: proc plot data = regout: This plot is based on the output dataset
from proc reg.
Line 33: plot of wresid versus height. The residual plot can be very useful
for detecting systematic deviations away from the hypothesized
model, or for detecting dependence between the size of the residuals
and values of predictors. Seeing patterns in the residuals may cause
you to add terms to the model or to transform the "Y" variable.
The set of residuals adds up to zero. Residuals are
sometimes plotted against the predicting variable (height,
in this case), which is useful for determining whether other
terms need to be added to the model, or against the predicted
value, which may be useful in determining whether a trans-
mation of the outcome variable is needed.
------------------------------------------------------------------------
Notes on Printout:
Lines 43-79: Proc plot to show the relationship of weight to
height: as always, it is a good idea to graph the data
and use descriptive stats before you carry out parameter
estimation and testing. In this case, one sees an obvious
increasing trend in weight with height. It appears to
be approximately linear and there are no obvious outliers
or highly influential points. The linear model which will
be used in proc reg,
model weight = height <---> weight = a + b*height + e
seems reasonable. The dataset is small (19 observations).
Lines 86-93: The 'Analysis of Variance' table for a least-squares
regression like this one is based on three sums of squares:
1. Model sum of squares (also called Regression SS):
Sum of (pred y - overall mean y)^2
across all the observations.
2. Error sum of squares (also called Residual SS):
Sum of (observed y - pred y)^2
across all the observations.
3. 'Corrected' total sum of squares ("C total" SS ):
Sum of (observed y - overall mean y)^2.
The first two of these add up to the third:
Regr SS + Error SS = C total SS
If Regr SS is large and Error SS is small, then the
regression is accounting for most of the variability.
4. The Degrees of Freedom (DF) for these sums of squares
are:
Regr SS DF = (no. of params - 1) = 2 - 1 = 1.
C total SS DF = (no. of obs - 1) = 19 - 1 = 18.
Error SS DF = (no. of obs - no. of params) = 19 - 2 = 17.
5. The mean squares are the sums of squares divided by
their associated degrees of freedom: for example,
Error Mean Square = 2142.48772 / 17 = 126.02689, etc.
6. The F-statistic is the quotient of the Regr MS by
the Error MS:
Regr SS / (p - 1) 7193.24912
F= ------------------ = ---------- = 57.06.
Error SS / (n - p) 126.02869
7. The p-value is the probability of seeing an F-value
as large as, or larger than, the observed, if
the null hypothesis is true.
The null hypothesis being tested here is that there
is no real relationship between weight and height.
That translates into the statement that the regression
coefficient b (i.e., slope) is zero:
H0: b = 0.
Here the p-value is < 0.0001. One would reject the
null hypothesis and conclude that weight is related
to height.
Line 95: Root MSE is the square root of the Error Mean Square:
that is, in this case,
Root MSE = sqrt(126.02869) = 11.22625.
This is the regression estimate of the standard deviation
sigma of the "measurement error". This is often called
"s". Another way to write it is as:
sqrt[(Error SS)/(n - p)].
Line 95: R-square is the quotient of (Regr SS) / (C Total SS): here,
R-square = 7193.24912 / 9335.73684 = 0.7705.
Note that R-square is always less than or equal to 1.
R-square will equal 1 only when the Error SS is zero, which
means that all the observations lie exactly on a straight
line. In general, it is often said that
"R-square is the proportion of the variability in
the outcome variable that is accounted for by the
regression."
In this case, about 77%.
Lines 104-105: Here are shown the parameter estimates, a# = intercept
and b# = slope, and other statistics. In this case the
prediction equation for weight as a function of height is:
weight = -143.027 + 3.899*height,
which means that if height increases by 1 inch, you would
expect weight to increase by 3.899 pounds.
The Standard Errors of the coefficients are given also.
These standard errors will be large if there is a lot
of variability around the regression line (large Error SS)
and small if the observed data points are very close
to the regression line (small Error SS).
T for H0: parameter = 0: This t-statistic is computed
as the quotient of the parameter estimate divided by its
standard error. In this case, it is compared to a
t-distribution with 1 degree of freedom. If the t-statistic
is large in absolute value, one rejects the null hypothesis.
The null hypothesis is that the true parameter is zero.
Prob > |T|: this is the probability that you would observe
a t-statistic with absolute value as large as, or larger
than, the observed value, if in fact the null hypothesis
were true. In this case one rejects both of the null
hypotheses:
Intercept 'a' is zero: p-value = 0.0004.
Slope 'b' is zero : p-value < 0.0001.
Notice that the latter test is identical to the F-test
in the ANOVA table. It is not a coincidence that
T (slope) = 7.555 = sqrt(F-statistic) = sqrt(57.076).
Lines 113-144: Plot of residuals versus predictor (height): Note that
residual = pred y - obsd y.
It is a fact that the mean of the residuals is always zero.
Therefore, in residual plots, one expects them to be
about equally scattered above and below the horizontal line
through 0 on the vertical axis.
In this case there is no obvious pattern or outliers among
the residuals. An outlier would represent an observation
which is far off of the regression line. There however a
formal definition of outlier which will be discussed below.
Lines 148-168: Printout of observations, predicted values, and statistics related to
influence diagnostics. Note that all of these are on the
output dataset 'regout' produced by proc reg in lines 27-28.
Note here that the predicted values for weight (WPRED) are as
given above,
weight = -143.027 + 3.899*height,
The residuals are the difference between the predicted weight
and the observed weight.
The variable 'WSTUDENT' is the 'Studentized residual'. This
provides a kind of normalized measure of how far the observed
weight differs from the predicted weight. As noted above,
if this is > 2 in absolute value, it indicates that the
point is an (perhaps unduly) influential point. In this
case the largest absolute value of WSTUDENT is 1.80087.
The variable WDFFITS is another measure of influence, and
again, one expects that it is less that 2 in absolute
value. The largest absolute values of WDFFITS here
is 0.74426.
As noted above, if WSTUDENT or WDFFITS had included some values
that were large in absolute value, there are two actions you would
take: (1) check the data to be sure there are no errors, and (2)
consider re-running the regression with exclusion of the influential
points, and see what kind of difference it makes in the F-statistic,
the R-square, and the parameter estimates. You can perform this
second regression using the dataset 'regout' as follows:
proc reg data = regout ;
model weight = height ;
where abs(wstudent) le 2.00 and abs(dffits) le 2.00 ;
title1 'Regression of weight on height, excluding influential points' ;
In this case, the regression would not change because the criteria in
the "where" statement are never met.
------------------------------------------------------------------------
Data Transformations:
One may want to transform either the outcome variable (Y) or the
predictor (X). In general the reasons for performing transformations
are the following:
1. To change what appears to be a curvilinear relationship into
a linear one.
2. To attain "homoscedasticity": "Homoscedasticity" means that
the measurement error has the same variability regardless of
the values of the outcome variable and the predictor. It often
happens in medical or biological studies that the variability
of a measured outcome is proportional to the size of that
outcome.
An example is FEV1, a measure of lung function, defined at the
volume of air in liters that a person can blow out in 1 second
when making a maximal effort. A large healthy person might
have an FEV1 of 5.00 liters, while a small, sickly or aged
person's FEV1 might be 1.5 liters. The measurement of FEV1
- unlike, say, weight or height - has a lot of random
variability. That is, two consecutive measurements of
FEV1 for the same person may easily differ by as much 10%.
For the large person with FEV1 = 5.000, this would amount
to saying sigma = .1*5.00 = 0.50 liters. For the small person
with FEV1 = 1.50, this yields sigma = .1*1.50 = 0.15 liters.
ANOVA tests and t-tests like those carried out in proc reg,
are based on the assumption of homoscedasticity. If
homoscedasticity is badly violated, these tests may give
misleading results.
Frequently investigators try to eliminate homoscedasticity
by transforming the outcome variable with a log transformation.
One might replace FEV1 by log(FEV1), and perform the
regression with this outcome variable rather than the original.
A way to detect non-homoscedasticity is to plot residuals
for the outcome variable against the predicted values.
See below for examples. Another way is to use options
in proc reg called SPEC or ACOV on the model statement.
Again see below for an example.
========================================================================
Transformations
The following graph illustrates the problem of heteroscedasticity:
http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf1
Here the variability in the dependent variable clearly increases
as the dependent variable itself increases. The plot referenced above
shows an approximately linear relationship between the dependent
variable Y as a function of the independent variable X. The smooth
curve superimposed on the data was generated by a SAS smoothing procedure
called proc loess.
The next graph, at
http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf2
shows the same data, but the dependent variable Y has been log-
transformed and replaced by log(Y). The X-variable is unchanged.
As you can see, the variability is somewhat "tamed down".
This, however, has come at a price. The formerly linear relationship
between Y and X does not look so linear when Y is replaced by
log(Y). Further, some of the observations are far below the
smoothed curve that was fit to the data. Transformations can help
in some cases, but in this case the log-transformation may have
done more harm than good.
========================================================================
Outliers:
It is easy to confuse outliers and influential points because
it will often happen that they are the same points. An influential
point has the following characteristic: if the regression is done
with the influential point omitted, there is a resulting large change
in the regression estimates of the parameters. Thus in most cases an
influential point will not be close to the regression line determined
by all the other points. When the influential point is included in the
regression, that regression line changes substantially. This is what
the influence diagnostic DFFITS measures.
An outlier also will not be close to the regression line determined
by the other points, but if it is included, it need not have
much effect on the regression line.
A point can be both an influential point and an outlier.
The effect of including outliers in a regression is to inflate
the estimate of Root MSE, or "s". This in turn increases the
standard errors of the parameter estimates, decreases the F-statistic
and the R-square, and decreases the power of the regression procedure
to detect relationships of the outcome variable to the independent
variables.
Outliers in regression are defined as follows. Let S be the
set of residuals from a regression. Suppose you constructed a
boxplot for the set S. The distance between the 25th percentile
and the 75th percentile in the boxplot is called the *interquartile
range*, or IQR. A value which is more than 1.5*IQR above the median
is called an outlier. Similarly a value which are more than 1.5*IQR
units below the median is called an outlier.
This leads to the following procedure to detect outliers in
regression:
1. Regress Y on the independent variable(s).
2. Output the residuals to a dataset.
3. Find the 25th percentile, the median, and the 75th
percentile (using proc univariate) of the residuals.
4. Find which points correspond to residuals which are either:
(i) above median + 1.5*IQR, or
(ii) below median - 1.5*IQR.
Below is a program which shows how this works for data from the
Lung Health Study. The participants were 161 women, aged 35-60.
Lung function (FEV1) was regressed on age. There is a plot at:
http://www.biostat.umn.edu/~john-c/5421/n54703.004.3.grf
which indicates the relationship of FEV1 and age.
========================================================================
options linesize = 80 ;
PROC REG DATA = SMOKE ;
WHERE NCASE GE 600 AND NCASE LE 1000 AND GENDER EQ 1 ;
MODEL S2FEVPRE = AGE ;
OUTPUT OUT = REGOUT R = RESFEV1 P = PDFEV1 DFFITS = DFFFEV1 ;
TITLE1 'REGRESSION OF FEV1 ON AGE - WOMEN ONLY' ;
PROC SORT DATA = REGOUT ; BY RESFEV1 ;
PROC PRINT DATA = REGOUT ;
VAR NCASE AGE S2FEVPRE PDFEV1 RESFEV1 DFFFEV1 ;
TITLE1 'SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE' ;
PROC UNIVARIATE DATA = REGOUT PLOT NORMAL ;
VAR RESFEV1 ;
TITLE1 'PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE' ;
proc loess data = regout ;
model s2fevpre = age / smooth = .2 ;
ods output OutputStatistics = Results ;
proc sort data = Results ; by age ;
proc print data = Results ;
var age DepVar Pred ;
symbol1 c = black v = dot w = .7 h = .7 ;
symbol2 c = black i = j v = none l = 1 w = 1 ;
GOPTIONS DEVICE = GIF ;
PROC GPLOT DATA = Results ;
PLOT DepVar * AGE Pred * AGE/ OVERLAY HAXIS = AXIS1 ;
AXIS1 ORDER = 34 TO 62 BY 4 ;
TITLE1 H = 2 'PLOT OF FEV1 VERSUS AGE: WOMEN ONLY' ;
RUN ;
PROC SORT DATA = REGOUT ; BY AGE ;
ENDSAS ;
------------------------------------------------------------------------
REGRESSION OF FEV1 ON AGE - WOMEN ONLY 1
18:57 Saturday, January 31, 2004
The REG Procedure
Model: MODEL1
Dependent Variable: S2FEVPRE
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 3.03831 3.03831 39.34 <.0001
Error 159 12.27991 0.07723
Corrected Total 160 15.31822
Root MSE 0.27791 R-Square 0.1983
Dependent Mean 2.07478 Adj R-Sq 0.1933
Coeff Var 13.39450
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value
Intercept Intercept 1 3.10145 0.16515 18.78
age AGE AT ENTRY INTO LHS 1 -0.02102 0.00335 -6.27
Parameter Estimates
Variable Label DF Pr > |t|
Intercept Intercept 1 <.0001
age AGE AT ENTRY INTO LHS 1 <.0001
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57
SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE 2
18:57 Saturday, January 31, 2004
Obs NCASE age S2FEVPRE PDFEV1 RESFEV1 DFFFEV1
1 734 56 1.30 1.92439 -0.62439 -0.26972
2 719 50 1.49 2.05050 -0.56050 -0.16407
3 687 53 1.43 1.98744 -0.55744 -0.19083
------------------------------ observations omitted -------------------
158 703 59 2.32 1.86133 0.45867 0.24696
159 810 52 2.48 2.00846 0.47154 0.15055
160 997 54 2.51 1.96642 0.54358 0.20016
161 995 43 2.77 2.19763 0.57237 0.22255
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57
PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 6
18:57 Saturday, January 31, 2004
The UNIVARIATE Procedure
Variable: RESFEV1 (Residual)
Moments
N 161 Sum Weights 161
Mean 0 Sum Observations 0
Std Deviation 0.27703684 Variance 0.07674941
Skewness -0.2361922 Kurtosis -0.8066874
Uncorrected SS 12.2799057 Corrected SS 12.2799057
Coeff Variation . Std Error Mean 0.02183356
Basic Statistical Measures
Location Variability
Mean 0.00000 Std Deviation 0.27704
Median 0.02561 Variance 0.07675
Mode -0.25356 Range 1.19675
Interquartile Range 0.42306
NOTE: The mode displayed is the smallest of 6 modes with a count of 2.
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t 0 Pr > |t| 1.0000
Sign M 7.5 Pr >= |M| 0.2698
Signed Rank S 144.5 Pr >= |S| 0.8082
Tests for Normality
Test --Statistic--- -----p Value------
Shapiro-Wilk W 0.975985 Pr < W 0.0066
Kolmogorov-Smirnov D 0.066321 Pr > D 0.0827
Cramer-von Mises W-Sq 0.177031 Pr > W-Sq 0.0102
Anderson-Darling A-Sq 1.121213 Pr > A-Sq 0.0063
Quantiles (Definition 5)
Quantile Estimate
100% Max 0.5723670
99% 0.5435764
95% 0.4223670
90% 0.3405193
75% Q3 0.2105193
50% Median 0.0256145
25% Q1 -0.2125378
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57
PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 7
18:57 Saturday, January 31, 2004
The UNIVARIATE Procedure
Variable: RESFEV1 (Residual)
Quantiles (Definition 5)
Quantile Estimate
10% -0.4055950
5% -0.4706902
1% -0.5604997
0% Min -0.6243855
Extreme Observations
------Lowest------ ------Highest-----
Value Obs Value Obs
-0.624385 1 0.443576 157
-0.560500 2 0.458672 158
-0.557443 3 0.471538 159
-0.527633 4 0.543576 160
-0.525595 5 0.572367 161
Stem Leaf # Boxplot
5 7 1 |
5 4 1 |
4 67 2 |
4 2223334 7 |
3 67999 5 |
3 0001234 7 |
2 567777799 9 |
2 0001113333444 13 +-----+
1 7888999 7 | |
1 012223333344444 15 | |
0 56677888899 11 | |
0 1111222334 10 *--+--*
-0 444322211 9 | |
-0 9987765 7 | |
-1 444211 6 | |
-1 99666655 8 | |
-2 3222100 7 +-----+
-2 7765555 7 |
-3 44332 5 |
-3 99776 5 |
-4 4221100 7 |
-4 8877655 7 |
-5 33 2 |
-5 66 2 |
-6 2 1 |
----+----+----+----+
Multiply Stem.Leaf by 10**-1
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57
PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE 8
18:57 Saturday, January 31, 2004
The UNIVARIATE Procedure
Variable: RESFEV1 (Residual)
Normal Probability Plot
0.575+ ++ *
| + *
| ++ **
| *****
| ***
| +**
| ***
| ***+
| **+
| ***+
| **+
| **+
-0.025+ ***
| **+
| **
| +**
| +***
| +**
| ++**
| ++**
| +***
| ****
| **
| * *+
-0.625+* ++
+----+----+----+----+----+----+----+----+----+----+
-2 -1 0 +1 +2
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 18:57
========================================================================
Note that the printout of proc univariate identifies the inter-
quartile range and the median of the residuals: they are
IQR = .42306 Median = .02561.
Therefore outliers in the residuals are those with values either:
above Median + 1.5*IQR = .6602, or
below Median - 1.5*IQR = -.6090.
From the printout above, as it happens, there is only one
observation which has a residual that would be considered an
outlier:
Obs NCASE age S2FEVPRE PDFEV1 RESFEV1 DFFFEV1
1 734 56 1.30 1.92439 -0.62439 -0.26972
See if you can locate this observation on the graph below!
==================================================================================
PLOT OF FEV1 VERSUS AGE: 161 LUNG HEALTH STUDY WOMEN 16
19:46 Saturday, January 31, 2004
Plot of DepVar*age. Legend: A = 1 obs, B = 2 obs, etc.
|
3.00000 +
|
|
| A
| A A A
| A A A A A A A
2.50000 + A A A A A A A
| A A A A A A
S | A A A B A A C A B A A
2 | A A A A A C A B A B B A B A
F | A B E A B B A A A B A
E | A A A B C A B C
V 2.00000 + A B B A B A A B A A
P | A A A A A B C A
R | A A B A A B A B B
E | B A B A A A A A A
| A A A B
| A B A A
1.50000 + A A A A
| A B A A
| A
|
|
|
1.00000 +
|
---+---------+---------+---------+---------+---------+--
35.00000 40.00000 45.00000 50.00000 55.00000 60.00000
AGE AT ENTRY INTO LHS
LUNG HEALTH STUDY : WBJEC5.SAS (JEC) 31JAN04 19:46
==================================================================================
The actions that can be taken for outliers in residuals are the following:
1. Check the data for errors if possible.
2. Re-run the regression excluding the outliers and see whether
it makes any substantial difference in the F-test or the
R-square or the T-tests for the coefficients.
==================================================================================
Problem 1:
Using the data on mortality and water hardness in Chapter 2 of the
Der-Everitt text, assume the model
mortality = a + b*hardness + e,
where the error term e has a normal distribution with mean 0 and standard
deviation sigma.
Use SAS and proc reg to analyze this data.
(1) Plot mortality versus water hardness.
(2) Compute predicted mortality and residuals and give a plot of residuals versus hardness
(3) Test for influential points and outliers.
(4) What does the F-test tell you?
(5) Find 95% confidence intervals for the coefficient estimates.
(6) What proportion of the variability in mortality is accounted for
by water hardness?
(7) What is the regression estimate of sigma ?
n54703.004 Last update: February 7, 2005.