SAS PROC NLIN, Continued                                       SPH 5421 notes.027

An Application to a Threshold Model

     It happens not infrequently in analyzing datasets that the investigator wants
to entertain the possibility that the data fit one model over a certain range of a
covariate and another model in another range.  More specifically, 'broken-line'
models are frequently proposed.  In this case, one parametrization of the model
could be:
                     _
                    |
                    |  a1 + b1 * X,  0 < X < S,  and
[1]         E(Y) =  |
                    |  a2 + b2 * X,  S < X < T
                    |_

where the coefficients are restricted so that the two lines agree when X = S.

     Such a model was proposed in a study of arsenic in drinking water, carried out
by the Minnesota State Health Department.  A graph of the data was as follows:







































----------------------------------------------------------------------------------
       Line-printer plot of log(arsenic in hair) vs. log(arsenic in water)      5
                               Men age 18-49 only   21:20 Monday, April 23, 2001

           Plot of lahair*lasave.  Legend: A = 1 obs, B = 2 obs, etc.

     |
     |
   1 +
     |
     |                                                                 A
     |                                                                  A
     |                                                             A        A
L    |                       A
O    |                                                       A          A
G  0 +                                                            A
(    |
A    |                                                 A    A     A   A
r    |
s    |                                                      A     B
e    |                                           A   A AAAB  AA   A  A
n    |                                                    AAAAA
i -1 +                                                 A   A    A
c    |                                                 AB      A  A      A
     |                                                A A
c    |                                               AA   BA A A  A
o    |                              A
n    |                               A                A A
c    |                                         A     AA
. -2 +
     |
-    |                                   AA
     |                                          A
h    |                                   A
a    |                                             A  B
i    |      A    A    A     A    A                A
r -3 +    A                               A
)    |
     |             A    A   A             A
     |
     |                  A
     |
     | A
  -4 +
     |
     -+-----------+-----------+-----------+-----------+-----------+-----------+-
  -1.00000        0        1.00000     2.00000     3.00000     4.00000   5.00000

                              LOG(Arsenic conc. - water)

        prog: marsamoeba.sas (jec) 23APR01 21:20 input file: mars.data1
----------------------------------------------------------------------------------

     The background for this study was the following.  Some private wells, mainly in
western Minnesota, were found to have relatively high concentrations of arsenic.  
Arsenic is a toxic metal, used many years ago as a pesticide (in Paris Green and
other preparations).  In humans it has a variety of toxic effects, and is classified
as a carcinogen.  It is often found in conjunction with deposits of other metals
such as lead or antimony.  It is used to impregnate wood to retard insect damage and
rot ('treated wood', which has a greenish color and is used in making outdoor
structures such as decks and playground equipment).

     The reason for its occurrence in private wells is not completely clear. It
appears to be due mainly to natural deposits in underlying rock.  The concentrations
in Minnesota are not as high as those that have been found in some other places in
the world, particularly parts of Chile and Taiwan.  But they may be high enough to
cause some health effects.

     The Minnesota Department of Health (MDH) decided that a study of arsenic in
well-water was warranted.  One objective was to see whether there was evidence that
people who ingested the well water over a period of time had signs of absorption of
arsenic in their bodies.  The approach to this question was to take measurements of
arsenic in the well water and of arsenic in samples of hair.  A variety of wells
were tested, some of which had relatively high levels of arsenic and some of which
had normal or low levels.

     The plot above shows the relationship between log(hair arsenic concentration)
and log(water arsenic concentration).  It seems clear from the graph that some kind
of relationship exists, and that it is not linear.  There is some reason to think
that a kind of threshold model applies: that is, below a certain water concentration
level, a person's body can rid itself of arsenic through the kidneys, while above
that concentration level (threshold), arsenic begins to accumulate in various parts
of the body.  Thus a broken line model might apply.

     [Note, however, that although this might sound like a reasonable explanation,
the broken line model was not proposed until after the investigators had looked at a
number of graphs.]

     Plots for women and for other age groups were similar.  The complete dataset
has 348 observations.  There are some missing values.

     If the expectation of the outcome variable (Y = log(hair arsenic conc.)) is in
fact a broken-line function of the predictor (X = log(water arsenic conc.)), then
how do you estimate the unknown parameters?  The unknowns are the slope and
intercept of the left-hand side of the broken line, the slope and intercept of the
right-hand side, and the place where the two lines are joined. One way of writing
the model is as follows:
                 _
                |
                | a + m1*(X - bend), when X <= bend
[2]      E(Y) = |
                | a + m2*(X - bend), when X >= bend
                |_

     Here 'bend' is the X-value of the point where the two lines are joined. The
slope of the left-hand line is m1 and that of the right-hand line is m2. The
coefficient a represents the y-value at the point where the two lines join.  Note
that in this parametrization, there are 4 unknowns:  a, m1, m2, and bend.  Also note
that the model is constructed such that, at X = bend, the y-values of the two
straight lines agree.

     A key feature here is that the x-value of the place where the two lines meet (x
= bend) is treated as an unknown.  Although it looks as though x = 2.0 is a
reasonable guess, PROC NLIN will use the observed data to find the maximum
likelihood estimate of bend.

     For brevity, represent the model [2] as:

[3]        Y = f(X; a, m1, m2, bend) + error,

where error is assumed to have a normal distribution with mean 0 and variance sig2,
and f refers to the function specified in [2].

     For fixed X, the f is linear as a function of a, m1, and m2, but not linear in
bend.  [Actually at the point X = bend itself, the derivative of f with respect to X
will be either discontinuous or undefined.]

     The sum of squares corresponding to model [3] for the dataset is:


           SS = sum((Y - f(X; a, m1, m2, bend))²).

     The derivatives of f with respect to the unknown coefficients are as follows:

           df/da = 1
                        _
                       |
                       | (X - bend), for X < bend
           df/dm1 =    |
                       |  0, for X < bend
                       |_
                        _
                       |
                       |  0, for X < bend
           df/dm2   =  |
                       | (X - bend), for X > bend
                       |_
                        _
                       |
                       | -m1, for X < bend
           df/dbend =  |
                       | -m2, for X > bend
                       |_

     The surprising thing here may be that, for fixed X, the derivatives of f with
respect to the four parameters all exist and are continuous.  The one exception is
df/dbend when X = bend.  This turns out not to be a problem, since one may assume
that none of the observed datapoints occurs at exactly X = bend. (Technically, the
value X = bend has measure 0, which means that the chance that the random variable X
would just happen to be there is 0.)

     What this implies is that one can use PROC NLIN to estimate all of the
coefficients.  This is done in the following program.  The iml Nelder-Mead simplex
program was used here also and applied to the same dataset.  This latter
program required 375 function evaluations and 198 iterations to converge.

     This is a special case of a 'spline model': a first-order spline with one
'knot' (at x = bend).  Cubic splines with variable numbers of knots are commonly
used to estimate underlying functional forms in regression problems.


     The following SAS program (1) reads the data and transforms some of the
variables, (2) plots a subset of the data (men aged 18-49) (3) uses proc loess (SAS
Ver 8) to draw a smooth curve through the data, (4) uses proc nlin to perform the
broken-line regression, (5) plots the broken-line fit to the data.

================================================================================
OPTIONS  LINESIZE = 150 ERRORS = 1  MPRINT ;
FILENAME GRAPH 'gsas.grf' ;
LIBNAME graphs '.' ;

GOPTIONS
         VPOS = 100 HPOS = 100
         RESET = GLOBAL
         ROTATE = LANDSCAPE
         FTEXT = SWISSB
         DEVICE = PS
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         COLORS = (BLACK)
         HTITLE = 1.5 HTEXT = 1.5 ;

* ================================================================== ;

 DATA MARS1 ;
      LENGTH AGE_STRA $12 HARPRDCT $3 VITMIN $3 CIGS $3 RESID $8
             SEX $6 ;
      retain oldresid ;
      infile '/home/walleye/john-c/arsenic/mars.data1' ;
      input   @7  ageyr
             @15  age_stra 12.
             @28  asave
             @36  a_hair
             @43  a_ncopro
             @51  a_nuro
             @60  a_portio
             @68  bmi
             @78  harprdct 3.
             @87  height
             @91  intakec
             @99  kilog
            @105  literc
            @114  vitmin   3.
            @122  cigs     3.
            @129  resid    8.
            @138  sex      6.
            @146  welmonth 7.
            ;

 lasave = . ;
 if asave gt 0 then lasave = log(asave) ;


 lahair   = log(a_hair) ;
 gender   = . ;
 if sex eq 'MALE' then gender = 1 ;
 if sex eq 'FEMALE' then gender = 2 ;
 men = 0 ; women = 0 ;
 if gender eq 1 then men = 1 ;
 if gender eq 2 then women =1 ;

 outlier = 0 ;
 if abs(lasave - .91629) lt .01 then outlier = 1 ;

 agegroup = . ;
 if ageyr ge  0 and ageyr lt 18 then agegroup = 1 ;
 if ageyr ge 18 and ageyr lt 50 then agegroup = 2 ;
 if ageyr ge 50                 then agegroup = 3 ;


* =================================================================== ;

LABEL  LASAVE   = LOG(Arsenic conc. - water)     ;
LABEL  LAHAIR   = LOG(Arsenic conc. - hair)     ;

run ;

*=============================================================================== ;

   footnote "prog: marsamoeba.sas (jec) &sysdate &systime input file: mars.data1" ;

*=============================================================================== ;
options linesize = 80 ;

data men1849 ;
     set mars1 ;
     where gender eq 1 and agegroup eq 2
       and lasave ne . and lahair ne . ;

proc loess data = men1849 ;
     model lahair = lasave / smooth = .5 ;
     ods output OutputStatistics = Results ;

proc sort data = men1849 ; by lasave ;
proc sort data = Results ; by lasave ;

proc print data = Results ;
title1 'Print of output data from proc loess.' ;

data Results ;
     merge Results men1849 ; by lasave;

symbol1 v = o c = black w = 2 ;
symbol2 i = j l = 1 c = gray w = 2 ;
axis1 label = (c = black f = swissb h = 2 'Log(arsenic conc. in water)') ;
axis2 label = (c = black f = swissb h = 2 a = 90
               'Log(arsenic conc. in hair)') ;

proc plot data = Results ;
     plot lahair * lasave ;
title1
 'Line-printer plot of log(arsenic in hair) vs. log(arsenic in water)';
title2 'Men age 18-49 only' ;

proc gplot data = Results ;
     plot lahair * lasave
          Pred   * lasave / overlay  haxis = axis1 vaxis = axis2 ;
title1 h = 2 f = swissb
 "Plot of observed and loess-predicted log(arsenic in hair)" ;
title2 h = 2 f = swissb
 "versus log(arsenic in water): smoothing parameter = .5" ;
title3 h = 2 f = swissb
 "Men aged 18-49 only." ;
run ;

proc nlin method = gauss data = men1849 ;
     parms m1 = 0
           m2 = 0
           bend = 0
           a  =  0  ;

      der.m1 = (lasave - bend) ;
      if lasave gt bend then der.m1 = 0 ;

      der.m2 = (lasave - bend) ;
      if lasave le bend then der.m2 = 0 ;

      der.bend = -m1 ;
      if lasave gt bend then
      der.bend = -m2 ;

      der.a = 1 ;

     f = a + m1 * (lasave - bend) ;

     if lasave gt bend then
     f = a + m2 * (lasave - bend) ;

     model lahair = f ;

title1 'PROC NLIN to find optimal bend, etc. - bent-line model' ;
title2 'Restricted to men aged 18-49 only' ;
format agegroup agegroup. ;
run ;

data Results ;
     set Results ;
     a    = -2.1940 ;
     m1   =   .3343 ;
     m2   =   .9632 ;
     bend =  2.2318 ;

     bentpred = a + m1 * (lasave - bend) ;
     if lasave gt bend then bentpred = a + m2 * (lasave - bend) ;

run ;

goptions gsfmode = append ;

proc gplot data = Results ;
     plot lahair * lasave
          bentpred * lasave / overlay  haxis = axis1 vaxis = axis2 ;
title1 h = 2 f = swissb
 "Plot of observed and bent-line-predicted log(arsenic in hair)" ;
title2 h = 2 f = swissb
 "versus log(arsenic in water): smoothing parameter = .5" ;
title3 h = 2 f = swissb
 "Men aged 18-49 only." ;
run ;

endsas ;

================================================================================

             PROC NLIN to find optimal knot, etc. - bent-line model            1
                       Restricted to men aged 18-49 only
                                                 17:56 Wednesday, March 29, 2000

                    Non-Linear Least Squares Iterative Phase
                Dependent Variable LAHAIR   Method: Gauss-Newton
Iter       M1             M2             BEND           A        Sum of Squares
   0              0              0              0              0     243.656742
   1      -0.035414       0.679495              0      -3.347825      34.590256
   2       0.270488       0.686342       0.338609      -3.140992      34.413807
   3       0.500980       0.728984       1.528413      -2.453234      34.057450
   4       0.545414       0.784473       2.541857      -1.813578      33.569593
   5       0.439681       0.866573       2.095891      -2.195154      32.190637
   6       0.334301       0.963285       2.296183      -2.151361      31.718857
   7       0.334301       0.963285       2.231830      -2.193982      31.688254
   8       0.334301       0.963285       2.231830      -2.193982      31.688254
NOTE: Convergence criterion met.


    Non-Linear Least Squares Summary Statistics     Dependent Variable LAHAIR

        Source                DF Sum of Squares     Mean Square

        Regression             4   211.96848830     52.99212207
        Residual              74    31.68825408      0.42821965
        Uncorrected Total     78   243.65674238

        (Corrected Total)     77    89.69401355


        Parameter     Estimate    Asymptotic             Asymptotic 95 %
                                  Std. Error         Confidence Interval
                                                     Lower         Upper
        M1         0.334301009 0.15956907908  0.0163518013  0.6522502165
        M2         0.963284615 0.15486426171  0.6547099867  1.2718592438
        BEND       2.231829514 0.54159495595  1.1526750307  3.3109839974
        A         -2.193981807 0.42579334961 -3.0423959160 -1.3455676971


                         Asymptotic Correlation Matrix

  Corr                M1                M2              BEND                 A
  ----------------------------------------------------------------------------
  M1                   1      2.284602E-15      0.6528113069      0.7998674596
  M2        2.284602E-15                 1      0.5630960809      0.2394395728
  BEND      0.6528113069      0.5630960809                 1      0.9215513257
  A         0.7998674596      0.2394395728      0.9215513257                 1
 
 
        prog: marsamoeba.sas (jec) 29MAR00 17:56 input file: mars.data1

=====================================================================

PROBLEM 29

     An investigator believes that the level of beta carotene Y in the blood as a
function of carotene X in the diet has the following functional form:
                   _
                  |
                  |  c, a constant, if x < T
          E(Y) =  |
                  |  c * exp(b * (x - T)),    if x >= T
                  |_

     Here T is an unknown threshold and c and b are unknown coefficients.

     Assume errors are independent and normally distributed with mean zero and
standard deviation sigma.


1.   Given the dataset below, use the simplex program to estimate
     c, b, and T.  Start with initial values: b = .5, c = 4, T = 4.
     Set itmax = 1500.

2.   Using the result from part 1,. make a scatterplot of the original
     data with the curve for the expected value of the outcome variable
     overlaid.

3.   Use PROC NLIN to estimate c, b, and T.  Use METHOD = MARQUARDT.
     Start with the same initial values as specified in part 1.  Are the
     NLIN answers consistent with the simplex answers?  Which one
     is more likely correct?

=======================================================================
                             
                       OBS       X       Y
                             
                         1     0.2     4.1791
                         2     0.4     5.2783
                         3     0.6     5.2020
                         4     0.8     4.9422
                         5     1.0     4.5159
                         6     1.2     5.4537
                         7     1.4     4.5392
                         8     1.6     4.2084
                         9     1.8     4.6268
                        10     2.0     5.7036
                        11     2.2     4.8654
                        12     2.4     4.7405
                        13     2.6     5.0699
                        14     2.8     4.6849
                        15     3.0     5.2495
                        16     3.2     5.4061
                        17     3.4     5.3747
                        18     3.6     4.7829
                        19     3.8     4.0466
                        20     4.0     4.8501
                        21     4.2     4.2060
                        22     4.4     5.4611
                        23     4.6     5.5197
                        24     4.8     4.6369
                        25     5.0     5.2025
                        26     5.2     5.9291
                        27     5.4     5.0583
                        28     5.6     4.8812
                        29     5.8     4.6772
                        30     6.0     6.0635
                        31     6.2     6.5457
                        32     6.4     7.1562
                        33     6.6     6.8768
                        34     6.8     7.6332
                        35     7.0     7.1068
                        36     7.2     8.1979
                        37     7.4     6.5250
                        38     7.6     8.4432
                        39     7.8     9.7943
                        40     8.0     8.4516
                        41     8.2    10.3283
                        42     8.4     9.1733
                        43     8.6     9.4781
                        44     8.8    12.0051
                        45     9.0    11.9928
                        46     9.2    11.8532
                        47     9.4    11.0441
                        48     9.6    15.2666
                        49     9.8    10.7804
                        50    10.0    16.1887
                                                                     


========================================================================

MINIMIZING FUNCTIONS IN SPLUS

     As described in notes.026, the Splus function nlmin can be used to minimize
functions.  In the following example, nlmin is used to compute the least squares
estimate of the coefficients for the model

                  y = b0 + b1 * x + e,

where e is a random error term with expectation 0:

=================================================================================
# Use of Splus function nlmin ...

b <- c(0, 0)

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

y <- c(1.78345, -0.54128, -0.44503, -0.66441, -1.72841, -3.00969,
+     -2.28972, -4.39501, -3.14249, -4.44228)

f1 <- function(v)
{
  predy <- v[1] + v[2] * x
  ssq <- sum((y - predy)^2)
  ssq
}

m <- nlmin(f1, b)

==================================================================================

     In this case, the unknown parameters b0 and b1 are stored in a two-
dimensional vector, b.  The values of the predictor x are:

     x = 1, 2, ..., 10.

The corresponding values of y are

     y = 1.78345, -0.54128, ..., -4.44228.

The function f1 is specified as a function of the dummy variable v.

     All the real work in this case is done in the line

     m <- nlmin(f1, b).

     This is where the function-minimizing routine nlmin is used.  The first
argument to nlmin must be the name of the function being evaluated, and the
second argument is the vector of values that minimize the function.  The function
itself in this case is the sum of squared differences between the values of
y and the predicted values of y, with the latter being defined in the line

     predy <- v[1] + v[2] * x.

Note that both x and predy are in fact vectors of length 10.

     Assume that the program above is called "minfun.s".  In Splus, you run this
program by entering:

     >source("minfun.s")

You may then enter

     >m

to obtain the output.  The output is comprised of three items: (1) the values
of the vector b which minimize the function; (2) a string saying whether
convergence occurred (T or F), and (3) the type of convergence obtained.

     The function-minimizing program above gives the same parameter values that
you would get using, for example, proc reg.

     Note that other arguments to nlmin are possible also: see the Splus
reference manual.

     Note that nlmin, like the simplex method, does not require specification
of derivatives.

     A variant of nlmin called nlminb is also available in Splus.  It allows
specification of ranges for the parameters.  The syntax for nlmin and nlminb
are not the same; see the Splus help pages for details.

================================================================================
PROBLEM 29, contin.

  4.  For the function G(x, y) =  x^4 + y^4 - 5*x^2 -7*y^2 + 12, write
      an Splus program that finds where G(x, y) takes on its minimum value.

  5.  Use nlmin to estimate c, b, and T for the data on beta carotene given above.

================================================================================




/home/walleye/john-c/5421/notes.027   Last update: April 24, 2001