PROC REG, I :  Linear Regression and related topics.           n54703.004

     Proc reg is one of the oldest and most used of the SAS procedures,
and it has an enormous number of options.  There is a lot of overlap between
proc reg and proc glm.  In general, proc glm does more, but there are some
things that proc reg will do that proc glm will not.

     The object in regression is to investigate the relationship between
a quantitative outcome variable Y and one or more predictors.  The simplest
case is the one-dimensional model

[1]     Y = a + b*X + e,

where Y is the outcome variable, X is the 'predictor', and 'a' and 'b' are
unknown coefficients.  The term "e" represents "measurement error".
Typically one assumes that e has a normal distribution N(0, sigma^2),
where the mean error is 0, and sigma represents the standard deviation
of measurement.  Typically sigma is an unknown positive number.  If you
fixed X at a certain value and measured Y over and over again, you would
expect the standard deviation of the resulting Y's to be close to sigma.

     Note that sigma^2 is the *variance* of Y.

     Proc reg and most other regression procedures produce estimates of
a, b, and sigma, and a number of other related statistics.  In the case
of model [1] above, what regression does essentially is find the straight
line which best fits the data.  Straight lines for a 1-dimensional model
like this are determined by their intercept (a) and their slope (b).
They also produce an estimate of sigma, the standard deviation of the
measurement error.  This estimate is usually called "s".  You need at least 
two points of observed data to estimate a and b.  To obtain a valid estimate
of sigma, there must be at least three observed data points.  In general
the number of data points will be considerably higher than 3.

    Proc reg prints coefficient estimates, their standard errors, an 
"analysis of variance" (or "anova") table, and other statistics.
 Here is an example, with the lines numbered for reference later:

==================================================================================

1    options linesize = 80 ;
2    footnote "~john-c/5421/n54703.004.sas &sysdate &systime" ;
3
4    data class ;
5         input name $ height weight age @@ ;
6    datalines ;
7    Alfred  69.0  112.5  14   Alice  56.5  84.0  13   Barbara  65.3  98.0  13
8    Carol   62.8  102.5  14   Henry  63.5 102.5  14   James    57.3  83.0  12
9    Jane    59.8   84.5  12   Janet  62.5 112.5  15   Jeffrey  62.5  84.0  13
10   John    59.0   99.5  12   Joyce  51.3  50.5  11   Judy     64.3  90.0  14
11   Louise  56.3   77.0  12   Mary   66.5 112.0  15   Philip   72.0 150.0  16
12   Robert  64.8  128.0  12   Ronald 67.0 133.0  15   Thomas   57.5  85.0  11
13   William 66.5  112.0  15
14   ;
15
16   run ;
17
18   options pagesize = 35 ;
19
20   proc plot data = class ;
21        plot weight*height ;
22   title1 "Plot of weight vs height in 19 schoolchildren" ;
23   run ;
24
25   proc reg data = class ;
26        model weight = height ;
27        output out = regout residual = wresid  rstudent = wstudent
28                     dffits = wdffits  predicted = wpred ;
29   title1 "Simple linear regression: Weight vs. Height in schoolchildren" ;
30   run ;
31
32   proc plot data = regout ;
33        plot wresid * height ;
34   title1 "Plot of regression residuals vs. height" ;
35   run ;
36
37   proc print data = regout ;
38        var age height weight wpred wresid wstudent wdffits ;
39   title1
40    "Printout of diagnostics: predicted, residuals, student resids, dffits" ;
41   run ;
42
43-----------------------------------------------------------------------------
44
45                    Plot of weight vs height in 19 schoolchildren                 1
46                                                     17:46 Friday, January 30, 2004
47
48              Plot of WEIGHT*HEIGHT.  Legend: A = 1 obs, B = 2 obs, etc.
49
50   WEIGHT |
51      150 +                                                              A
52          |
53          |
54          |                                                A
55          |                                         A
56      125 +
57          |
58          |                                   A                 A
59          |                                              B
60          |                                    A A
61      100 +                         A                 A
62          |
63          |                                        A
64          |                  A AA     A       A
65          |
66       75 +                  A
67          |
68          |
69          |
70          |
71       50 +    A
72          -+-------------+-------------+-------------+-------------+-------------+-
73          50            55            60            65            70            75
74
75                                            HEIGHT
76
77
78
79                      ~john-c/5421/n54703.004.sas 30JAN04 17:46
80             Simple linear regression: Weight vs. Height in schoolchildren         2
81                                                     17:46 Friday, January 30, 2004
82
83   Model: MODEL1
84   Dependent Variable: WEIGHT
85
86                                 Analysis of Variance
87
88                                    Sum of         Mean
89           Source          DF      Squares       Square      F Value       Prob>F
90
91           Model            1   7193.24912   7193.24912       57.076       0.0001
92           Error           17   2142.48772    126.02869
93           C Total         18   9335.73684
94
95               Root MSE      11.22625     R-square       0.7705
96               Dep Mean     100.02632     Adj R-sq       0.7570
97               C.V.          11.22330
98
99                                 Parameter Estimates
100
101                         Parameter      Standard    T for H0:
102        Variable  DF      Estimate         Error   Parameter=0    Prob > |T|
103
104        INTERCEP   1   -143.026918   32.27459130        -4.432        0.0004
105        HEIGHT     1      3.899030    0.51609395         7.555        0.0001
106
107
108
109                     ~john-c/5421/n54703.004.sas 30JAN04 17:46
110                       Plot of regression residuals vs. height                    3
111                                                    17:46 Friday, January 30, 2004
112
113             Plot of WRESID*HEIGHT.  Legend: A = 1 obs, B = 2 obs, etc.
114
115        |
116     20 +
117        |                                          A
118        |
119        |                                                 A
120         |                          A         A                          A
121      10 +
122  R     |
123  e     |                   A
124  s     |                      A
125  i     |                     A
126  d   0 +                   A                 A
127  u     |                                       A
128  a     |                                               B
129  l     |     A                      A
130        |
131    -10 +
132        |
133        |                                            A         A
134        |                                    A
135        |                                         A
136    -20 +
137        --+-------------+-------------+-------------+-------------+-------------+-
138         50            55            60            65            70            75
139
140                                          HEIGHT
141
142
143
144                     ~john-c/5421/n54703.004.sas 30JAN04 17:46
145        Printout of diagnostics: predicted, residuals, student resids, dffits     4
146                                                    17:46 Friday, January 30, 2004
147
148   OBS    AGE    HEIGHT    WEIGHT     WPRED       WRESID    WSTUDENT     WDFFITS
149
150     1     14     69.0      112.5    126.006    -13.5062    -1.33150    -0.55156
151     2     13     56.5       84.0     77.268      6.7317     0.62942     0.23750
152     3     13     65.3       98.0    111.580    -13.5798    -1.27834    -0.35390
153     4     14     62.8      102.5    101.832      0.6678     0.05931     0.01404
154     5     14     63.5      102.5    104.562     -2.0615    -0.18350    -0.04448
155     6     12     57.3       83.0     80.388      2.6125     0.23923     0.08249
156     7     12     59.8       84.5     90.135     -5.6351    -0.50799    -0.13529
157     8     15     62.5      112.5    100.662     11.8375     1.08931     0.25690
158     9     13     62.5       84.0    100.662    -16.6625    -1.59234    -0.37553
159    10     12     59.0       99.5     87.016     12.4841     1.16942     0.33577
160    11     11     51.3       50.5     56.993     -6.4933    -0.68541    -0.45950
161    12     14     64.3       90.0    107.681    -17.6807    -1.71545    -0.43638
162    13     12     56.3       77.0     76.488      0.5115     0.04739     0.01829
163    14     15     66.5      112.0    116.259     -4.2586    -0.38743    -0.12129
164    15     16     72.0      150.0    137.703     12.2967     1.28918     0.74426
165    16     12     64.8      128.0    109.630     18.3698     1.80087     0.47660
166    17     15     67.0      133.0    118.208     14.7919     1.42979     0.47285
167    18     11     57.5       85.0     81.167      3.8327     0.35087     0.11830
168    19     15     66.5      112.0    116.259     -4.2586    -0.38743    -0.12129
169
170
171                    ~john-c/5421/n54703.004.sas 30JAN04 17:46

===============================================================================

Notes on the program:

Line 5:   note that the input statement is designed to read in multiple
          records per line of the 'datalines'.  Again, note the '@@'
          at the end of line 5, indicating that data on the current
          continues to be read until the end of the line is encountered.

Line 20:  As always, it is a good idea to plot the data BEFORE you carry
          out an analysis.  This may tell you that a simple linear model
          is not appropriate.

Lines 25-28: In this case proc reg is used to study the relationship of
          weight to height in schoolchildren aged 11 to 16.  The model
          statement tells you that the model is:

              weight = a + b*height + e,

          where e represents variability in weight that is not accounted
          for by height, or "measurement error".  Measurement error is not
          really the right description here.  Most of the deviation away
          from the linear trend is due to interperson variability in this
          case rather than to measurement error: to put it crudely, some kids 
          are thin and others are fat.  Height is an important predictor of
          weight, but clearly there is variability in body shape and other
          factors that affect weight also.

Lines 27-28: The output statement is used to study some regression
          diagnostics which will be printed later.  The output statement
          creates another SAS dataset, in this case called "regout"
          which includes:

                1.  The original variables, height and weight

                2.  residuals: residuals are the difference between the
                    observed weights and the weights which are predicted
                    by the regresssion equation.

                3.  studentized residuals, wstudent.  These are used to
                    judge whether some observations have an unusually
                    large effect on the regression parameters.  If the
                    studentized residual for a given observation is
                    greater than 2 in absolute value, that observation
                    should be checked to be sure that the data are
                    correct.  In some cases it may be advisable to re-run
                    the regression with such observations deleted.

                4.  dffits statistic: this is another method for determining
                    the influence of a single observation.  It is computed
                    by doing the regression with that observation omitted,
                    and seeing how much difference that makes in the
                    parameter estimates.  Again if dffits is greater than
                    2 in absolute value, it suggest that this is an unduly
                    influential observation.

                5.  predicted value, wpred: in this case, this is the
                    predicted value of weight given height based on the
                    coefficients; that is,

                       pred weight = a# + b# * height,

                    where a# and b# are the regression estimates of
                    the intercept and slope respectively.

                    Note that the residual at a given value of height is:

                       resid = pred weight - observed weight.


Line 32: proc plot data = regout:  This plot is based on the output dataset
         from proc reg.

Line 33: plot of wresid versus height.  The residual plot can be very useful
         for detecting systematic deviations away from the hypothesized
         model, or for detecting dependence between the size of the residuals
         and values of predictors.  Seeing patterns in the residuals may cause
         you to add terms to the model or to transform the "Y" variable.

         The set of residuals adds up to zero.  Residuals are
         sometimes plotted against the predicting variable (height,
         in this case), which is useful for determining whether other
         terms need to be added to the model, or against the predicted
         value, which may be useful in determining whether a trans-
         mation of the outcome variable is needed.

------------------------------------------------------------------------

Notes on Printout:

Lines 43-79:  Proc plot to show the relationship of weight to
         height: as always, it is a good idea to graph the data
         and use descriptive stats before you carry out parameter
         estimation and testing.  In this case, one sees an obvious
         increasing trend in weight with height.  It appears to
         be approximately linear and there are no obvious outliers
         or highly influential points.  The linear model which will
         be used in proc reg,

         model weight = height  <--->  weight = a + b*height + e

         seems reasonable.  The dataset is small (19 observations).

Lines 86-93:  The 'Analysis of Variance' table for a least-squares
         regression like this one is based on three sums of squares:

         1.  Model sum of squares (also called Regression SS):

                 Sum of (pred y - overall mean y)^2

             across all the observations.

         2.  Error sum of squares (also called Residual SS):

                 Sum of (observed y - pred y)^2

             across all the observations.

         3.  'Corrected' total sum of squares ("C total" SS ):

                 Sum of (observed y - overall mean y)^2.

         The first two of these add up to the third:

             Regr SS  +  Error SS = C total SS

         If Regr SS is large and Error SS is small, then the
         regression is accounting for most of the variability.

         4.  The Degrees of Freedom (DF) for these sums of squares
             are:

             Regr SS DF    = (no. of params - 1)          =  2 - 1 = 1.

             C total SS DF = (no. of obs - 1)             = 19 - 1 = 18.

             Error SS DF   = (no. of obs - no. of params) = 19 - 2 = 17.

         5.  The mean squares are the sums of squares divided by
             their associated degrees of freedom: for example,

             Error Mean Square = 2142.48772 / 17 = 126.02689, etc.

         6.  The F-statistic is the quotient of the Regr MS by
             the Error MS:

                   Regr SS / (p - 1)      7193.24912
             F=    ------------------  =  ----------  =  57.06.
                   Error SS / (n - p)      126.02869

         7.  The p-value is the probability of seeing an F-value
             as large as, or larger than, the observed, if
             the null hypothesis is true.

             The null hypothesis being tested here is that there
             is no real relationship between weight and height.
             That translates into the statement that the regression
             coefficient b (i.e., slope) is zero:

                  H0:  b = 0.

             Here the p-value is < 0.0001.  One would reject the
             null hypothesis and conclude that weight is related
             to height.

Line 95:  Root MSE is the square root of the Error Mean Square:
          that is, in this case,

               Root MSE = sqrt(126.02869) = 11.22625.

          This is the regression estimate of the standard deviation
          sigma of the "measurement error".  This is often called
          "s".  Another way to write it is as:

               sqrt[(Error SS)/(n - p)].

Line 95:  R-square is the quotient of (Regr SS) / (C Total SS): here,

          R-square = 7193.24912 / 9335.73684 = 0.7705.

          Note that R-square is always less than or equal to 1.
          R-square will equal 1 only when the Error SS is zero, which
          means that all the observations lie exactly on a straight
          line.  In general, it is often said that

             "R-square is the proportion of the variability in
              the outcome variable that is accounted for by the
              regression."

          In this case, about 77%.

Lines 104-105: Here are shown the parameter estimates, a# = intercept
          and b# = slope, and other statistics.  In this case the
          prediction equation for weight as a function of height is:

               weight = -143.027 + 3.899*height,

          which means that if height increases by 1 inch, you would
          expect weight to increase by 3.899 pounds.

          The Standard Errors of the coefficients are given also.
          These standard errors will be large if there is a lot
          of variability around the regression line (large Error SS)
          and small if the observed data points are very close
          to the regression line (small Error SS).

          T for H0: parameter = 0:  This t-statistic is computed
          as the quotient of the parameter estimate divided by its
          standard error.  In this case, it is compared to a
          t-distribution with 1 degree of freedom.  If the t-statistic
          is large in absolute value, one rejects the null hypothesis.
          The null hypothesis is that the true parameter is zero.

          Prob > |T|: this is the probability that you would observe
          a t-statistic with absolute value as large as, or larger
          than, the observed value, if in fact the null hypothesis
          were true.  In this case one rejects both of the null
          hypotheses:

               Intercept 'a'  is zero:   p-value = 0.0004.

               Slope 'b' is zero     :   p-value < 0.0001.

          Notice that the latter test is identical to the F-test
          in the ANOVA table.  It is not a coincidence that

          T (slope) = 7.555 = sqrt(F-statistic) = sqrt(57.076).


Lines 113-144: Plot of residuals versus predictor (height): Note that
          residual = pred y - obsd y.

          It is a fact that the mean of the residuals is always zero.
          Therefore, in residual plots, one expects them to be
          about equally scattered above and below the horizontal line
          through 0 on the vertical axis.

          In this case there is no obvious pattern or outliers among
          the residuals.  An outlier would represent an observation
          which is far off of the regression line.  There however a
          formal definition of outlier which will be discussed below.

Lines 148-168:  Printout of observations, predicted values, and statistics related to
          influence diagnostics.  Note that all of these are on the
          output dataset 'regout' produced by proc reg in lines 27-28.
          Note here that the predicted values for weight (WPRED) are as
          given above,

               weight = -143.027 + 3.899*height,

          The residuals are the difference between the predicted weight
          and the observed weight.

          The variable 'WSTUDENT' is the 'Studentized residual'.  This
          provides a kind of normalized measure of how far the observed
          weight differs from the predicted weight.  As noted above,
          if this is > 2 in absolute value, it indicates that the
          point is an (perhaps unduly) influential point.  In this
          case the largest absolute value of WSTUDENT is 1.80087.

          The variable WDFFITS is another measure of influence, and
          again, one expects that it is less that 2 in absolute
          value.  The largest absolute values of WDFFITS here
          is 0.74426.

     As noted above, if WSTUDENT or WDFFITS had included some values
that were large in absolute value, there are two actions you would
take: (1) check the data to be sure there are no errors, and (2)
consider re-running the regression with exclusion of the influential
points, and see what kind of difference it makes in the F-statistic,
the R-square, and the parameter estimates.  You can perform this
second regression using the dataset 'regout' as follows:

   proc reg data = regout ;
        model weight = height ;
        where abs(wstudent) le 2.00 and abs(dffits) le 2.00 ;
   title1 'Regression of weight on height, excluding influential points' ;

In this case, the regression would not change because the criteria in
the "where" statement are never met.

------------------------------------------------------------------------

Data Transformations:

     One may want to transform either the outcome variable (Y) or the
predictor (X).  In general the reasons for performing transformations
are the following:

     1.  To change what appears to be a curvilinear relationship into
         a linear one.

     2.  To attain "homoscedasticity":  "Homoscedasticity" means that
         the measurement error has the same variability regardless of
         the values of the outcome variable and the predictor.  It often
         happens in medical or biological studies that the variability
         of a measured outcome is proportional to the size of that
         outcome.

         An example is FEV1, a measure of lung function, defined at the
         volume of air in liters that a person can blow out in 1 second
         when making a maximal effort.  A large healthy person might
         have an FEV1 of 5.00 liters, while a small, sickly or aged
         person's FEV1 might be 1.5 liters.  The measurement of FEV1
         - unlike, say, weight or height - has a lot of random
         variability.  That is, two consecutive measurements of
         FEV1 for the same person may easily differ by as much 10%.
         For the large person with FEV1 = 5.000, this would amount
         to saying sigma = .1*5.00 = 0.50 liters.  For the small person
         with FEV1 = 1.50, this yields sigma = .1*1.50 = 0.15 liters.

         ANOVA tests and t-tests like those carried out in proc reg,
         are based on the assumption of homoscedasticity.  If
         homoscedasticity is badly violated, these tests may give
         misleading results.

         Frequently investigators try to eliminate homoscedasticity
         by transforming the outcome variable with a log transformation.
         One might replace FEV1 by log(FEV1), and perform the
         regression with this outcome variable rather than the original.

         A way to detect non-homoscedasticity is to plot residuals
         for the outcome variable against the predicted values.
         See below for examples.  Another way is to use options
         in proc reg called SPEC or ACOV on the model statement.
         Again see below for an example.

========================================================================
Transformations

     The following graph illustrates the problem of heteroscedasticity:

     http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf1

     Here the variability in the dependent variable clearly increases
as the dependent variable itself increases.  The plot referenced above
shows an approximately linear relationship between the dependent
variable Y as a function of the independent variable X.  The smooth
curve superimposed on the data was generated by a SAS smoothing procedure
called proc loess.

     The next graph, at

     http://www.biostat.umn.edu/~john-c/5421/n54703.004.2.grf2

shows the same data, but the dependent variable Y has been log-
transformed and replaced by log(Y).  The X-variable is unchanged.

     As you can see, the variability is somewhat "tamed down".
This, however, has come at a price.  The formerly linear relationship
between Y and X does not look so linear when Y is replaced by
log(Y).  Further, some of the observations are far below the
smoothed curve that was fit to the data.  Transformations can help
in some cases, but in this case the log-transformation may have
done more harm than good.

========================================================================

Outliers:

     It is easy to confuse outliers and influential points because
it will often happen that they are the same points.  An influential
point has the following characteristic: if the regression is done
with the influential point omitted, there is a resulting large change
in the regression estimates of the parameters.  Thus in most cases an
influential point will not be close to the regression line determined
by all the other points.  When the influential point is included in the
regression, that regression line changes substantially.  This is what
the influence diagnostic DFFITS measures.

     An outlier also will not be close to the regression line determined
by the other points, but if it is included, it need not have
much effect on the regression line.

     A point can be both an influential point and an outlier.

     The effect of including outliers in a regression is to inflate
the estimate of Root MSE, or "s".  This in turn increases the
standard errors of the parameter estimates, decreases the F-statistic
and the R-square, and decreases the power of the regression procedure 
to detect relationships of the outcome variable to the independent
variables.

     Outliers in regression are defined as follows.  Let S be the
set of residuals from a regression.  Suppose you constructed a
boxplot for the set S.  The distance between the 25th percentile
and the 75th percentile in the boxplot is called the *interquartile
range*, or IQR.  A value which is more than 1.5*IQR above the median
is called an outlier.  Similarly a value which are more than 1.5*IQR
units below the median is called an outlier.

     This leads to the following procedure to detect outliers in 
regression:

     1.  Regress Y on the independent variable(s).  

     2.  Output the residuals to a dataset.  

     3.  Find the 25th percentile, the median, and the 75th
         percentile (using proc univariate) of the residuals.

     4.  Find which points correspond to residuals which are either:

         (i)   above median + 1.5*IQR, or

         (ii)  below median - 1.5*IQR.

    Below is a program which shows how this works for data from the
Lung Health Study.  The participants were 161 women, aged 35-60.
Lung function (FEV1) was regressed on age.  There is a plot at:

     http://www.biostat.umn.edu/~john-c/5421/n54703.004.3.grf

which indicates the relationship of FEV1 and age.

========================================================================


options linesize = 80 ;

PROC REG DATA = SMOKE ;
     WHERE NCASE GE 600 AND NCASE LE 1000 AND GENDER EQ 1 ;
     MODEL S2FEVPRE = AGE ;
     OUTPUT OUT = REGOUT R = RESFEV1  P = PDFEV1 DFFITS = DFFFEV1 ;
TITLE1 'REGRESSION OF FEV1 ON AGE - WOMEN ONLY' ;

PROC SORT DATA = REGOUT ; BY RESFEV1 ;

PROC PRINT DATA = REGOUT ;
     VAR NCASE AGE S2FEVPRE PDFEV1 RESFEV1 DFFFEV1 ;
TITLE1 'SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE' ;

PROC UNIVARIATE DATA = REGOUT PLOT NORMAL ;
     VAR RESFEV1 ;
TITLE1 'PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE' ;

proc loess data = regout ;
     model s2fevpre = age / smooth = .2 ;
     ods output OutputStatistics = Results ;

proc sort data = Results ; by age ;

proc print data = Results ;
     var   age DepVar Pred ;

symbol1 c = black v = dot w = .7 h = .7 ;
symbol2 c = black i = j v = none l = 1 w = 1 ;

GOPTIONS DEVICE = GIF ;

PROC GPLOT DATA = Results ;
     PLOT DepVar * AGE Pred * AGE/ OVERLAY HAXIS = AXIS1 ;
AXIS1 ORDER = 34 TO 62 BY 4 ;
TITLE1 H = 2 'PLOT OF FEV1 VERSUS AGE: WOMEN ONLY' ;
RUN ;

PROC SORT DATA = REGOUT ; BY AGE ;

ENDSAS ;

------------------------------------------------------------------------


                     REGRESSION OF FEV1 ON AGE - WOMEN ONLY                    1
                                                18:57 Saturday, January 31, 2004

                               The REG Procedure
                                 Model: MODEL1
                         Dependent Variable: S2FEVPRE 

                              Analysis of Variance
 
                                     Sum of           Mean
 Source                   DF        Squares         Square    F Value    Pr > F

 Model                     1        3.03831        3.03831      39.34    <.0001
 Error                   159       12.27991        0.07723                     
 Corrected Total         160       15.31822                                    


              Root MSE              0.27791    R-Square     0.1983
              Dependent Mean        2.07478    Adj R-Sq     0.1933
              Coeff Var            13.39450                       


                              Parameter Estimates
 
                                             Parameter      Standard
  Variable    Label                   DF      Estimate         Error   t Value

  Intercept   Intercept                1       3.10145       0.16515     18.78
  age         AGE AT ENTRY INTO LHS    1      -0.02102       0.00335     -6.27

                              Parameter Estimates
 
               Variable    Label                   DF   Pr > |t|

               Intercept   Intercept                1     <.0001
               age         AGE AT ENTRY INTO LHS    1     <.0001
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 18:57

                   SOME REGRESSION DIAGNOSTICS ... FEV1 VS AGE                  2
                                                18:57 Saturday, January 31, 2004

       Obs    NCASE    age    S2FEVPRE     PDFEV1     RESFEV1     DFFFEV1

         1     734      56      1.30      1.92439    -0.62439    -0.26972
         2     719      50      1.49      2.05050    -0.56050    -0.16407
         3     687      53      1.43      1.98744    -0.55744    -0.19083

------------------------------ observations omitted -------------------

        158     703      59      2.32      1.86133    0.45867    0.24696
        159     810      52      2.48      2.00846    0.47154    0.15055
        160     997      54      2.51      1.96642    0.54358    0.20016
        161     995      43      2.77      2.19763    0.57237    0.22255
 
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 18:57

           PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE          6
                                                18:57 Saturday, January 31, 2004

                            The UNIVARIATE Procedure
                         Variable:  RESFEV1  (Residual)

                                    Moments

        N                         161    Sum Weights                161
        Mean                        0    Sum Observations             0
        Std Deviation      0.27703684    Variance            0.07674941
        Skewness           -0.2361922    Kurtosis            -0.8066874
        Uncorrected SS     12.2799057    Corrected SS        12.2799057
        Coeff Variation             .    Std Error Mean      0.02183356


                           Basic Statistical Measures
 
                 Location                    Variability

             Mean      0.00000     Std Deviation            0.27704
             Median    0.02561     Variance                 0.07675
             Mode     -0.25356     Range                    1.19675
                                   Interquartile Range      0.42306

     NOTE: The mode displayed is the smallest of 6 modes with a count of 2.


                           Tests for Location: Mu0=0
 
                Test           -Statistic-    -----p Value------

                Student's t    t         0    Pr > |t|    1.0000
                Sign           M       7.5    Pr >= |M|   0.2698
                Signed Rank    S     144.5    Pr >= |S|   0.8082


                              Tests for Normality
 
           Test                  --Statistic---    -----p Value------

           Shapiro-Wilk          W     0.975985    Pr < W      0.0066
           Kolmogorov-Smirnov    D     0.066321    Pr > D      0.0827
           Cramer-von Mises      W-Sq  0.177031    Pr > W-Sq   0.0102
           Anderson-Darling      A-Sq  1.121213    Pr > A-Sq   0.0063


                            Quantiles (Definition 5)
 
                            Quantile        Estimate

                            100% Max       0.5723670
                            99%            0.5435764
                            95%            0.4223670
                            90%            0.3405193
                            75% Q3         0.2105193
                            50% Median     0.0256145
                            25% Q1        -0.2125378
 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 18:57
           PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE          7
                                                18:57 Saturday, January 31, 2004

                            The UNIVARIATE Procedure
                         Variable:  RESFEV1  (Residual)

                            Quantiles (Definition 5)
 
                            Quantile        Estimate

                            10%           -0.4055950
                            5%            -0.4706902
                            1%            -0.5604997
                            0% Min        -0.6243855


                              Extreme Observations
 
                  ------Lowest------        ------Highest-----
 
                      Value      Obs            Value      Obs

                  -0.624385        1         0.443576      157
                  -0.560500        2         0.458672      158
                  -0.557443        3         0.471538      159
                  -0.527633        4         0.543576      160
                  -0.525595        5         0.572367      161


                Stem Leaf                     #             Boxplot
                   5 7                        1                |
                   5 4                        1                |
                   4 67                       2                |
                   4 2223334                  7                |
                   3 67999                    5                |
                   3 0001234                  7                |
                   2 567777799                9                |
                   2 0001113333444           13             +-----+
                   1 7888999                  7             |     |
                   1 012223333344444         15             |     |
                   0 56677888899             11             |     |
                   0 1111222334              10             *--+--*
                  -0 444322211                9             |     |
                  -0 9987765                  7             |     |
                  -1 444211                   6             |     |
                  -1 99666655                 8             |     |
                  -2 3222100                  7             +-----+
                  -2 7765555                  7                |
                  -3 44332                    5                |
                  -3 99776                    5                |
                  -4 4221100                  7                |
                  -4 8877655                  7                |
                  -5 33                       2                |
                  -5 66                       2                |
                  -6 2                        1                |
                     ----+----+----+----+
                 Multiply Stem.Leaf by 10**-1


              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 18:57
           PROC UNIVARIATE TO DETECT OUTLIERS IN RESIDUALS FEV1 VS AGE          8
                                                18:57 Saturday, January 31, 2004

                            The UNIVARIATE Procedure
                         Variable:  RESFEV1  (Residual)

                                 Normal Probability Plot
             0.575+                                             ++   *
                  |                                            +   *
                  |                                          ++ **
                  |                                        *****
                  |                                      ***
                  |                                    +**
                  |                                  ***
                  |                               ***+
                  |                              **+
                  |                           ***+
                  |                          **+
                  |                        **+
            -0.025+                      ***
                  |                     **+
                  |                    **
                  |                  +**
                  |                +***
                  |               +**
                  |             ++**
                  |           ++**
                  |         +***
                  |       ****
                  |     **
                  |  * *+
            -0.625+* ++
                   +----+----+----+----+----+----+----+----+----+----+
                       -2        -1         0        +1        +2

 
              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 18:57
========================================================================

     Note that the printout of proc univariate identifies the inter-
quartile range and the median of the residuals: they are

     IQR = .42306   Median = .02561.

     Therefore outliers in the residuals are those with values either:

     above Median + 1.5*IQR  = .6602, or
     below Median - 1.5*IQR  = -.6090.

     From the printout above, as it happens, there is only one
observation which has a residual that would be considered an
outlier:

       Obs    NCASE    age    S2FEVPRE     PDFEV1     RESFEV1     DFFFEV1

         1     734      56      1.30      1.92439    -0.62439    -0.26972

     See if you can locate this observation on the graph below!

==================================================================================

             PLOT OF FEV1 VERSUS AGE: 161 LUNG HEALTH STUDY WOMEN             16
                                                19:46 Saturday, January 31, 2004

            Plot of DepVar*age.  Legend: A = 1 obs, B = 2 obs, etc.

                 |
         3.00000 +
                 |
                 |
                 |                  A
                 |    A     A A
                 |    A     A A A A A A
         2.50000 +      A A       A         A   A     A   A
                 |        A               A           A A A A
       S         |      A   A A   B   A   A   C A   B           A   A
       2         |        A   A A   A A   C A B A B B     A B A
       F         |  A   B             E A   B   B     A   A A   B A
       E         |        A     A                   A B   C A   B C
       V 2.00000 +            A B     B A B     A A         B   A A
       P         |          A         A   A   A A           B C   A
       R         |              A   A       B     A A     B A B B
       E         |            B   A     B A         A   A A   A       A
                 |                  A   A           A B
                 |                                  A   B A         A
         1.50000 +                                A       A       A A
                 |                                      A       B A A
                 |                                            A
                 |
                 |
                 |
         1.00000 +
                 |
                 ---+---------+---------+---------+---------+---------+--
                35.00000  40.00000  45.00000  50.00000  55.00000  60.00000

                                   AGE AT ENTRY INTO LHS



              LUNG HEALTH STUDY :  WBJEC5.SAS (JEC) 31JAN04 19:46

==================================================================================

     The actions that can be taken for outliers in residuals are the following:

     1.  Check the data for errors if possible.

     2.  Re-run the regression excluding the outliers and see whether
         it makes any substantial difference in the F-test or the
         R-square or the T-tests for the coefficients.

==================================================================================
Problem 1:

      Using the data on mortality and water hardness in Chapter 2 of the
Der-Everitt text, assume the model

      mortality = a + b*hardness + e,

where the error term e has a normal distribution with mean 0 and standard
deviation sigma.

Use SAS and proc reg to analyze this data.

   (1) Plot mortality versus water hardness.
   (2) Compute predicted mortality and residuals and give a plot of residuals versus hardness
   (3) Test for influential points and outliers.
   (4) What does the F-test tell you?
   (5) Find 95% confidence intervals for the coefficient estimates.
   (6) What proportion of the variability in mortality is accounted for
       by water hardness?
   (7) What is the regression estimate of sigma ?


n54703.004  Last update: February 7, 2005.