SAS PROC NLIN, Continued SPH 5421 notes.027
An Application to a Threshold Model
It happens not infrequently in analyzing datasets that the investigator wants
to entertain the possibility that the data fit one model over a certain range of a
covariate and another model in another range. More specifically, 'broken-line'
models are frequently proposed. In this case, one parametrization of the model
could be:
_
|
| a1 + b1 * X, 0 < X < S, and
[1] E(Y) = |
| a2 + b2 * X, S < X < T
|_
where the coefficients are restricted so that the two lines agree when X = S.
Such a model was proposed in a study of arsenic in drinking water, carried out
by the Minnesota State Health Department. A graph of the data was as follows:
----------------------------------------------------------------------------------
Line-printer plot of log(arsenic in hair) vs. log(arsenic in water) 5
Men age 18-49 only 21:20 Monday, April 23, 2001
Plot of lahair*lasave. Legend: A = 1 obs, B = 2 obs, etc.
|
|
1 +
|
| A
| A
| A A
L | A
O | A A
G 0 + A
( |
A | A A A A
r |
s | A B
e | A A AAAB AA A A
n | AAAAA
i -1 + A A A
c | AB A A A
| A A
c | AA BA A A A
o | A
n | A A A
c | A AA
. -2 +
|
- | AA
| A
h | A
a | A B
i | A A A A A A
r -3 + A A
) |
| A A A A
|
| A
|
| A
-4 +
|
-+-----------+-----------+-----------+-----------+-----------+-----------+-
-1.00000 0 1.00000 2.00000 3.00000 4.00000 5.00000
LOG(Arsenic conc. - water)
prog: marsamoeba.sas (jec) 23APR01 21:20 input file: mars.data1
----------------------------------------------------------------------------------
The background for this study was the following. Some private wells, mainly in
western Minnesota, were found to have relatively high concentrations of arsenic.
Arsenic is a toxic metal, used many years ago as a pesticide (in Paris Green and
other preparations). In humans it has a variety of toxic effects, and is classified
as a carcinogen. It is often found in conjunction with deposits of other metals
such as lead or antimony. It is used to impregnate wood to retard insect damage and
rot ('treated wood', which has a greenish color and is used in making outdoor
structures such as decks and playground equipment).
The reason for its occurrence in private wells is not completely clear. It
appears to be due mainly to natural deposits in underlying rock. The concentrations
in Minnesota are not as high as those that have been found in some other places in
the world, particularly parts of Chile and Taiwan. But they may be high enough to
cause some health effects.
The Minnesota Department of Health (MDH) decided that a study of arsenic in
well-water was warranted. One objective was to see whether there was evidence that
people who ingested the well water over a period of time had signs of absorption of
arsenic in their bodies. The approach to this question was to take measurements of
arsenic in the well water and of arsenic in samples of hair. A variety of wells
were tested, some of which had relatively high levels of arsenic and some of which
had normal or low levels.
The plot above shows the relationship between log(hair arsenic concentration)
and log(water arsenic concentration). It seems clear from the graph that some kind
of relationship exists, and that it is not linear. There is some reason to think
that a kind of threshold model applies: that is, below a certain water concentration
level, a person's body can rid itself of arsenic through the kidneys, while above
that concentration level (threshold), arsenic begins to accumulate in various parts
of the body. Thus a broken line model might apply.
[Note, however, that although this might sound like a reasonable explanation,
the broken line model was not proposed until after the investigators had looked at a
number of graphs.]
Plots for women and for other age groups were similar. The complete dataset
has 348 observations. There are some missing values.
If the expectation of the outcome variable (Y = log(hair arsenic conc.)) is in
fact a broken-line function of the predictor (X = log(water arsenic conc.)), then
how do you estimate the unknown parameters? The unknowns are the slope and
intercept of the left-hand side of the broken line, the slope and intercept of the
right-hand side, and the place where the two lines are joined. One way of writing
the model is as follows:
_
|
| a + m1*(X - bend), when X <= bend
[2] E(Y) = |
| a + m2*(X - bend), when X >= bend
|_
Here 'bend' is the X-value of the point where the two lines are joined. The
slope of the left-hand line is m1 and that of the right-hand line is m2. The
coefficient a represents the y-value at the point where the two lines join. Note
that in this parametrization, there are 4 unknowns: a, m1, m2, and bend. Also note
that the model is constructed such that, at X = bend, the y-values of the two
straight lines agree.
A key feature here is that the x-value of the place where the two lines meet (x
= bend) is treated as an unknown. Although it looks as though x = 2.0 is a
reasonable guess, PROC NLIN will use the observed data to find the maximum
likelihood estimate of bend.
For brevity, represent the model [2] as:
[3] Y = f(X; a, m1, m2, bend) + error,
where error is assumed to have a normal distribution with mean 0 and variance sig2,
and f refers to the function specified in [2].
For fixed X, the f is linear as a function of a, m1, and m2, but not linear in
bend. [Actually at the point X = bend itself, the derivative of f with respect to X
will be either discontinuous or undefined.]
The sum of squares corresponding to model [3] for the dataset is:
SS = sum((Y - f(X; a, m1, m2, bend))2).
The derivatives of f with respect to the unknown coefficients are as follows:
df/da = 1
_
|
| (X - bend), for X < bend
df/dm1 = |
| 0, for X < bend
|_
_
|
| 0, for X < bend
df/dm2 = |
| (X - bend), for X > bend
|_
_
|
| -m1, for X < bend
df/dbend = |
| -m2, for X > bend
|_
The surprising thing here may be that, for fixed X, the derivatives of f with
respect to the four parameters all exist and are continuous. The one exception is
df/dbend when X = bend. This turns out not to be a problem, since one may assume
that none of the observed datapoints occurs at exactly X = bend. (Technically, the
value X = bend has measure 0, which means that the chance that the random variable X
would just happen to be there is 0.)
What this implies is that one can use PROC NLIN to estimate all of the
coefficients. This is done in the following program. The iml Nelder-Mead simplex
program was used here also and applied to the same dataset. This latter
program required 375 function evaluations and 198 iterations to converge.
This is a special case of a 'spline model': a first-order spline with one
'knot' (at x = bend). Cubic splines with variable numbers of knots are commonly
used to estimate underlying functional forms in regression problems.
The following SAS program (1) reads the data and transforms some of the
variables, (2) plots a subset of the data (men aged 18-49) (3) uses proc loess (SAS
Ver 8) to draw a smooth curve through the data, (4) uses proc nlin to perform the
broken-line regression, (5) plots the broken-line fit to the data.
================================================================================
OPTIONS LINESIZE = 150 ERRORS = 1 MPRINT ;
FILENAME GRAPH 'gsas.grf' ;
LIBNAME graphs '.' ;
GOPTIONS
VPOS = 100 HPOS = 100
RESET = GLOBAL
ROTATE = LANDSCAPE
FTEXT = SWISSB
DEVICE = PS
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
COLORS = (BLACK)
HTITLE = 1.5 HTEXT = 1.5 ;
* ================================================================== ;
DATA MARS1 ;
LENGTH AGE_STRA $12 HARPRDCT $3 VITMIN $3 CIGS $3 RESID $8
SEX $6 ;
retain oldresid ;
infile '/home/walleye/john-c/arsenic/mars.data1' ;
input @7 ageyr
@15 age_stra 12.
@28 asave
@36 a_hair
@43 a_ncopro
@51 a_nuro
@60 a_portio
@68 bmi
@78 harprdct 3.
@87 height
@91 intakec
@99 kilog
@105 literc
@114 vitmin 3.
@122 cigs 3.
@129 resid 8.
@138 sex 6.
@146 welmonth 7.
;
lasave = . ;
if asave gt 0 then lasave = log(asave) ;
lahair = log(a_hair) ;
gender = . ;
if sex eq 'MALE' then gender = 1 ;
if sex eq 'FEMALE' then gender = 2 ;
men = 0 ; women = 0 ;
if gender eq 1 then men = 1 ;
if gender eq 2 then women =1 ;
outlier = 0 ;
if abs(lasave - .91629) lt .01 then outlier = 1 ;
agegroup = . ;
if ageyr ge 0 and ageyr lt 18 then agegroup = 1 ;
if ageyr ge 18 and ageyr lt 50 then agegroup = 2 ;
if ageyr ge 50 then agegroup = 3 ;
* =================================================================== ;
LABEL LASAVE = LOG(Arsenic conc. - water) ;
LABEL LAHAIR = LOG(Arsenic conc. - hair) ;
run ;
*=============================================================================== ;
footnote "prog: marsamoeba.sas (jec) &sysdate &systime input file: mars.data1" ;
*=============================================================================== ;
options linesize = 80 ;
data men1849 ;
set mars1 ;
where gender eq 1 and agegroup eq 2
and lasave ne . and lahair ne . ;
proc loess data = men1849 ;
model lahair = lasave / smooth = .5 ;
ods output OutputStatistics = Results ;
proc sort data = men1849 ; by lasave ;
proc sort data = Results ; by lasave ;
proc print data = Results ;
title1 'Print of output data from proc loess.' ;
data Results ;
merge Results men1849 ; by lasave;
symbol1 v = o c = black w = 2 ;
symbol2 i = j l = 1 c = gray w = 2 ;
axis1 label = (c = black f = swissb h = 2 'Log(arsenic conc. in water)') ;
axis2 label = (c = black f = swissb h = 2 a = 90
'Log(arsenic conc. in hair)') ;
proc plot data = Results ;
plot lahair * lasave ;
title1
'Line-printer plot of log(arsenic in hair) vs. log(arsenic in water)';
title2 'Men age 18-49 only' ;
proc gplot data = Results ;
plot lahair * lasave
Pred * lasave / overlay haxis = axis1 vaxis = axis2 ;
title1 h = 2 f = swissb
"Plot of observed and loess-predicted log(arsenic in hair)" ;
title2 h = 2 f = swissb
"versus log(arsenic in water): smoothing parameter = .5" ;
title3 h = 2 f = swissb
"Men aged 18-49 only." ;
run ;
proc nlin method = gauss data = men1849 ;
parms m1 = 0
m2 = 0
bend = 0
a = 0 ;
der.m1 = (lasave - bend) ;
if lasave gt bend then der.m1 = 0 ;
der.m2 = (lasave - bend) ;
if lasave le bend then der.m2 = 0 ;
der.bend = -m1 ;
if lasave gt bend then
der.bend = -m2 ;
der.a = 1 ;
f = a + m1 * (lasave - bend) ;
if lasave gt bend then
f = a + m2 * (lasave - bend) ;
model lahair = f ;
title1 'PROC NLIN to find optimal bend, etc. - bent-line model' ;
title2 'Restricted to men aged 18-49 only' ;
format agegroup agegroup. ;
run ;
data Results ;
set Results ;
a = -2.1940 ;
m1 = .3343 ;
m2 = .9632 ;
bend = 2.2318 ;
bentpred = a + m1 * (lasave - bend) ;
if lasave gt bend then bentpred = a + m2 * (lasave - bend) ;
run ;
goptions gsfmode = append ;
proc gplot data = Results ;
plot lahair * lasave
bentpred * lasave / overlay haxis = axis1 vaxis = axis2 ;
title1 h = 2 f = swissb
"Plot of observed and bent-line-predicted log(arsenic in hair)" ;
title2 h = 2 f = swissb
"versus log(arsenic in water): smoothing parameter = .5" ;
title3 h = 2 f = swissb
"Men aged 18-49 only." ;
run ;
endsas ;
================================================================================
PROC NLIN to find optimal knot, etc. - bent-line model 1
Restricted to men aged 18-49 only
17:56 Wednesday, March 29, 2000
Non-Linear Least Squares Iterative Phase
Dependent Variable LAHAIR Method: Gauss-Newton
Iter M1 M2 BEND A Sum of Squares
0 0 0 0 0 243.656742
1 -0.035414 0.679495 0 -3.347825 34.590256
2 0.270488 0.686342 0.338609 -3.140992 34.413807
3 0.500980 0.728984 1.528413 -2.453234 34.057450
4 0.545414 0.784473 2.541857 -1.813578 33.569593
5 0.439681 0.866573 2.095891 -2.195154 32.190637
6 0.334301 0.963285 2.296183 -2.151361 31.718857
7 0.334301 0.963285 2.231830 -2.193982 31.688254
8 0.334301 0.963285 2.231830 -2.193982 31.688254
NOTE: Convergence criterion met.
Non-Linear Least Squares Summary Statistics Dependent Variable LAHAIR
Source DF Sum of Squares Mean Square
Regression 4 211.96848830 52.99212207
Residual 74 31.68825408 0.42821965
Uncorrected Total 78 243.65674238
(Corrected Total) 77 89.69401355
Parameter Estimate Asymptotic Asymptotic 95 %
Std. Error Confidence Interval
Lower Upper
M1 0.334301009 0.15956907908 0.0163518013 0.6522502165
M2 0.963284615 0.15486426171 0.6547099867 1.2718592438
BEND 2.231829514 0.54159495595 1.1526750307 3.3109839974
A -2.193981807 0.42579334961 -3.0423959160 -1.3455676971
Asymptotic Correlation Matrix
Corr M1 M2 BEND A
----------------------------------------------------------------------------
M1 1 2.284602E-15 0.6528113069 0.7998674596
M2 2.284602E-15 1 0.5630960809 0.2394395728
BEND 0.6528113069 0.5630960809 1 0.9215513257
A 0.7998674596 0.2394395728 0.9215513257 1
prog: marsamoeba.sas (jec) 29MAR00 17:56 input file: mars.data1
=====================================================================
PROBLEM 29
An investigator believes that the level of beta carotene Y in the blood as a
function of carotene X in the diet has the following functional form:
_
|
| c, a constant, if x < T
E(Y) = |
| c * exp(b * (x - T)), if x >= T
|_
Here T is an unknown threshold and c and b are unknown coefficients.
Assume errors are independent and normally distributed with mean zero and
standard deviation sigma.
1. Given the dataset below, use the simplex program to estimate
c, b, and T. Start with initial values: b = .5, c = 4, T = 4.
Set itmax = 1500.
2. Using the result from part 1,. make a scatterplot of the original
data with the curve for the expected value of the outcome variable
overlaid.
3. Use PROC NLIN to estimate c, b, and T. Use METHOD = MARQUARDT.
Start with the same initial values as specified in part 1. Are the
NLIN answers consistent with the simplex answers? Which one
is more likely correct?
=======================================================================
OBS X Y
1 0.2 4.1791
2 0.4 5.2783
3 0.6 5.2020
4 0.8 4.9422
5 1.0 4.5159
6 1.2 5.4537
7 1.4 4.5392
8 1.6 4.2084
9 1.8 4.6268
10 2.0 5.7036
11 2.2 4.8654
12 2.4 4.7405
13 2.6 5.0699
14 2.8 4.6849
15 3.0 5.2495
16 3.2 5.4061
17 3.4 5.3747
18 3.6 4.7829
19 3.8 4.0466
20 4.0 4.8501
21 4.2 4.2060
22 4.4 5.4611
23 4.6 5.5197
24 4.8 4.6369
25 5.0 5.2025
26 5.2 5.9291
27 5.4 5.0583
28 5.6 4.8812
29 5.8 4.6772
30 6.0 6.0635
31 6.2 6.5457
32 6.4 7.1562
33 6.6 6.8768
34 6.8 7.6332
35 7.0 7.1068
36 7.2 8.1979
37 7.4 6.5250
38 7.6 8.4432
39 7.8 9.7943
40 8.0 8.4516
41 8.2 10.3283
42 8.4 9.1733
43 8.6 9.4781
44 8.8 12.0051
45 9.0 11.9928
46 9.2 11.8532
47 9.4 11.0441
48 9.6 15.2666
49 9.8 10.7804
50 10.0 16.1887
========================================================================
MINIMIZING FUNCTIONS IN SPLUS
As described in notes.026, the Splus function nlmin can be used to minimize
functions. In the following example, nlmin is used to compute the least squares
estimate of the coefficients for the model
y = b0 + b1 * x + e,
where e is a random error term with expectation 0:
=================================================================================
# Use of Splus function nlmin ...
b <- c(0, 0)
x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1.78345, -0.54128, -0.44503, -0.66441, -1.72841, -3.00969,
+ -2.28972, -4.39501, -3.14249, -4.44228)
f1 <- function(v)
{
predy <- v[1] + v[2] * x
ssq <- sum((y - predy)^2)
ssq
}
m <- nlmin(f1, b)
==================================================================================
In this case, the unknown parameters b0 and b1 are stored in a two-
dimensional vector, b. The values of the predictor x are:
x = 1, 2, ..., 10.
The corresponding values of y are
y = 1.78345, -0.54128, ..., -4.44228.
The function f1 is specified as a function of the dummy variable v.
All the real work in this case is done in the line
m <- nlmin(f1, b).
This is where the function-minimizing routine nlmin is used. The first
argument to nlmin must be the name of the function being evaluated, and the
second argument is the vector of values that minimize the function. The function
itself in this case is the sum of squared differences between the values of
y and the predicted values of y, with the latter being defined in the line
predy <- v[1] + v[2] * x.
Note that both x and predy are in fact vectors of length 10.
Assume that the program above is called "minfun.s". In Splus, you run this
program by entering:
>source("minfun.s")
You may then enter
>m
to obtain the output. The output is comprised of three items: (1) the values
of the vector b which minimize the function; (2) a string saying whether
convergence occurred (T or F), and (3) the type of convergence obtained.
The function-minimizing program above gives the same parameter values that
you would get using, for example, proc reg.
Note that other arguments to nlmin are possible also: see the Splus
reference manual.
Note that nlmin, like the simplex method, does not require specification
of derivatives.
A variant of nlmin called nlminb is also available in Splus. It allows
specification of ranges for the parameters. The syntax for nlmin and nlminb
are not the same; see the Splus help pages for details.
================================================================================
PROBLEM 29, contin.
4. For the function G(x, y) = x^4 + y^4 - 5*x^2 -7*y^2 + 12, write
an Splus program that finds where G(x, y) takes on its minimum value.
5. Use nlmin to estimate c, b, and T for the data on beta carotene given above.
================================================================================
/home/walleye/john-c/5421/notes.027 Last update: April 24, 2001