SAS PROC NLIN, Continued SPH 5421 notes.027 An Application to a Threshold Model It happens not infrequently in analyzing datasets that the investigator wants to entertain the possibility that the data fit one model over a certain range of a covariate and another model in another range. More specifically, 'broken-line' models are frequently proposed. In this case, one parametrization of the model could be: _ | | a1 + b1 * X, 0 < X < S, and [1] E(Y) = | | a2 + b2 * X, S < X < T |_ where the coefficients are restricted so that the two lines agree when X = S. Such a model was proposed in a study of arsenic in drinking water, carried out by the Minnesota State Health Department. A graph of the data was as follows: ---------------------------------------------------------------------------------- Line-printer plot of log(arsenic in hair) vs. log(arsenic in water) 5 Men age 18-49 only 21:20 Monday, April 23, 2001 Plot of lahair*lasave. Legend: A = 1 obs, B = 2 obs, etc. | | 1 + | | A | A | A A L | A O | A A G 0 + A ( | A | A A A A r | s | A B e | A A AAAB AA A A n | AAAAA i -1 + A A A c | AB A A A | A A c | AA BA A A A o | A n | A A A c | A AA . -2 + | - | AA | A h | A a | A B i | A A A A A A r -3 + A A ) | | A A A A | | A | | A -4 + | -+-----------+-----------+-----------+-----------+-----------+-----------+- -1.00000 0 1.00000 2.00000 3.00000 4.00000 5.00000 LOG(Arsenic conc. - water) prog: marsamoeba.sas (jec) 23APR01 21:20 input file: mars.data1 ---------------------------------------------------------------------------------- The background for this study was the following. Some private wells, mainly in western Minnesota, were found to have relatively high concentrations of arsenic. Arsenic is a toxic metal, used many years ago as a pesticide (in Paris Green and other preparations). In humans it has a variety of toxic effects, and is classified as a carcinogen. It is often found in conjunction with deposits of other metals such as lead or antimony. It is used to impregnate wood to retard insect damage and rot ('treated wood', which has a greenish color and is used in making outdoor structures such as decks and playground equipment). The reason for its occurrence in private wells is not completely clear. It appears to be due mainly to natural deposits in underlying rock. The concentrations in Minnesota are not as high as those that have been found in some other places in the world, particularly parts of Chile and Taiwan. But they may be high enough to cause some health effects. The Minnesota Department of Health (MDH) decided that a study of arsenic in well-water was warranted. One objective was to see whether there was evidence that people who ingested the well water over a period of time had signs of absorption of arsenic in their bodies. The approach to this question was to take measurements of arsenic in the well water and of arsenic in samples of hair. A variety of wells were tested, some of which had relatively high levels of arsenic and some of which had normal or low levels. The plot above shows the relationship between log(hair arsenic concentration) and log(water arsenic concentration). It seems clear from the graph that some kind of relationship exists, and that it is not linear. There is some reason to think that a kind of threshold model applies: that is, below a certain water concentration level, a person's body can rid itself of arsenic through the kidneys, while above that concentration level (threshold), arsenic begins to accumulate in various parts of the body. Thus a broken line model might apply. [Note, however, that although this might sound like a reasonable explanation, the broken line model was not proposed until after the investigators had looked at a number of graphs.] Plots for women and for other age groups were similar. The complete dataset has 348 observations. There are some missing values. If the expectation of the outcome variable (Y = log(hair arsenic conc.)) is in fact a broken-line function of the predictor (X = log(water arsenic conc.)), then how do you estimate the unknown parameters? The unknowns are the slope and intercept of the left-hand side of the broken line, the slope and intercept of the right-hand side, and the place where the two lines are joined. One way of writing the model is as follows: _ | | a + m1*(X - bend), when X <= bend [2] E(Y) = | | a + m2*(X - bend), when X >= bend |_ Here 'bend' is the X-value of the point where the two lines are joined. The slope of the left-hand line is m1 and that of the right-hand line is m2. The coefficient a represents the y-value at the point where the two lines join. Note that in this parametrization, there are 4 unknowns: a, m1, m2, and bend. Also note that the model is constructed such that, at X = bend, the y-values of the two straight lines agree. A key feature here is that the x-value of the place where the two lines meet (x = bend) is treated as an unknown. Although it looks as though x = 2.0 is a reasonable guess, PROC NLIN will use the observed data to find the maximum likelihood estimate of bend. For brevity, represent the model [2] as: [3] Y = f(X; a, m1, m2, bend) + error, where error is assumed to have a normal distribution with mean 0 and variance sig2, and f refers to the function specified in [2]. For fixed X, the f is linear as a function of a, m1, and m2, but not linear in bend. [Actually at the point X = bend itself, the derivative of f with respect to X will be either discontinuous or undefined.] The sum of squares corresponding to model [3] for the dataset is: SS = sum((Y - f(X; a, m1, m2, bend))2). The derivatives of f with respect to the unknown coefficients are as follows: df/da = 1 _ | | (X - bend), for X < bend df/dm1 = | | 0, for X < bend |_ _ | | 0, for X < bend df/dm2 = | | (X - bend), for X > bend |_ _ | | -m1, for X < bend df/dbend = | | -m2, for X > bend |_ The surprising thing here may be that, for fixed X, the derivatives of f with respect to the four parameters all exist and are continuous. The one exception is df/dbend when X = bend. This turns out not to be a problem, since one may assume that none of the observed datapoints occurs at exactly X = bend. (Technically, the value X = bend has measure 0, which means that the chance that the random variable X would just happen to be there is 0.) What this implies is that one can use PROC NLIN to estimate all of the coefficients. This is done in the following program. The iml Nelder-Mead simplex program was used here also and applied to the same dataset. This latter program required 375 function evaluations and 198 iterations to converge. This is a special case of a 'spline model': a first-order spline with one 'knot' (at x = bend). Cubic splines with variable numbers of knots are commonly used to estimate underlying functional forms in regression problems. The following SAS program (1) reads the data and transforms some of the variables, (2) plots a subset of the data (men aged 18-49) (3) uses proc loess (SAS Ver 8) to draw a smooth curve through the data, (4) uses proc nlin to perform the broken-line regression, (5) plots the broken-line fit to the data. ================================================================================ OPTIONS LINESIZE = 150 ERRORS = 1 MPRINT ; FILENAME GRAPH 'gsas.grf' ; LIBNAME graphs '.' ; GOPTIONS VPOS = 100 HPOS = 100 RESET = GLOBAL ROTATE = LANDSCAPE FTEXT = SWISSB DEVICE = PS GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER COLORS = (BLACK) HTITLE = 1.5 HTEXT = 1.5 ; * ================================================================== ; DATA MARS1 ; LENGTH AGE_STRA $12 HARPRDCT $3 VITMIN $3 CIGS $3 RESID $8 SEX $6 ; retain oldresid ; infile '/home/walleye/john-c/arsenic/mars.data1' ; input @7 ageyr @15 age_stra 12. @28 asave @36 a_hair @43 a_ncopro @51 a_nuro @60 a_portio @68 bmi @78 harprdct 3. @87 height @91 intakec @99 kilog @105 literc @114 vitmin 3. @122 cigs 3. @129 resid 8. @138 sex 6. @146 welmonth 7. ; lasave = . ; if asave gt 0 then lasave = log(asave) ; lahair = log(a_hair) ; gender = . ; if sex eq 'MALE' then gender = 1 ; if sex eq 'FEMALE' then gender = 2 ; men = 0 ; women = 0 ; if gender eq 1 then men = 1 ; if gender eq 2 then women =1 ; outlier = 0 ; if abs(lasave - .91629) lt .01 then outlier = 1 ; agegroup = . ; if ageyr ge 0 and ageyr lt 18 then agegroup = 1 ; if ageyr ge 18 and ageyr lt 50 then agegroup = 2 ; if ageyr ge 50 then agegroup = 3 ; * =================================================================== ; LABEL LASAVE = LOG(Arsenic conc. - water) ; LABEL LAHAIR = LOG(Arsenic conc. - hair) ; run ; *=============================================================================== ; footnote "prog: marsamoeba.sas (jec) &sysdate &systime input file: mars.data1" ; *=============================================================================== ; options linesize = 80 ; data men1849 ; set mars1 ; where gender eq 1 and agegroup eq 2 and lasave ne . and lahair ne . ; proc loess data = men1849 ; model lahair = lasave / smooth = .5 ; ods output OutputStatistics = Results ; proc sort data = men1849 ; by lasave ; proc sort data = Results ; by lasave ; proc print data = Results ; title1 'Print of output data from proc loess.' ; data Results ; merge Results men1849 ; by lasave; symbol1 v = o c = black w = 2 ; symbol2 i = j l = 1 c = gray w = 2 ; axis1 label = (c = black f = swissb h = 2 'Log(arsenic conc. in water)') ; axis2 label = (c = black f = swissb h = 2 a = 90 'Log(arsenic conc. in hair)') ; proc plot data = Results ; plot lahair * lasave ; title1 'Line-printer plot of log(arsenic in hair) vs. log(arsenic in water)'; title2 'Men age 18-49 only' ; proc gplot data = Results ; plot lahair * lasave Pred * lasave / overlay haxis = axis1 vaxis = axis2 ; title1 h = 2 f = swissb "Plot of observed and loess-predicted log(arsenic in hair)" ; title2 h = 2 f = swissb "versus log(arsenic in water): smoothing parameter = .5" ; title3 h = 2 f = swissb "Men aged 18-49 only." ; run ; proc nlin method = gauss data = men1849 ; parms m1 = 0 m2 = 0 bend = 0 a = 0 ; der.m1 = (lasave - bend) ; if lasave gt bend then der.m1 = 0 ; der.m2 = (lasave - bend) ; if lasave le bend then der.m2 = 0 ; der.bend = -m1 ; if lasave gt bend then der.bend = -m2 ; der.a = 1 ; f = a + m1 * (lasave - bend) ; if lasave gt bend then f = a + m2 * (lasave - bend) ; model lahair = f ; title1 'PROC NLIN to find optimal bend, etc. - bent-line model' ; title2 'Restricted to men aged 18-49 only' ; format agegroup agegroup. ; run ; data Results ; set Results ; a = -2.1940 ; m1 = .3343 ; m2 = .9632 ; bend = 2.2318 ; bentpred = a + m1 * (lasave - bend) ; if lasave gt bend then bentpred = a + m2 * (lasave - bend) ; run ; goptions gsfmode = append ; proc gplot data = Results ; plot lahair * lasave bentpred * lasave / overlay haxis = axis1 vaxis = axis2 ; title1 h = 2 f = swissb "Plot of observed and bent-line-predicted log(arsenic in hair)" ; title2 h = 2 f = swissb "versus log(arsenic in water): smoothing parameter = .5" ; title3 h = 2 f = swissb "Men aged 18-49 only." ; run ; endsas ; ================================================================================ PROC NLIN to find optimal knot, etc. - bent-line model 1 Restricted to men aged 18-49 only 17:56 Wednesday, March 29, 2000 Non-Linear Least Squares Iterative Phase Dependent Variable LAHAIR Method: Gauss-Newton Iter M1 M2 BEND A Sum of Squares 0 0 0 0 0 243.656742 1 -0.035414 0.679495 0 -3.347825 34.590256 2 0.270488 0.686342 0.338609 -3.140992 34.413807 3 0.500980 0.728984 1.528413 -2.453234 34.057450 4 0.545414 0.784473 2.541857 -1.813578 33.569593 5 0.439681 0.866573 2.095891 -2.195154 32.190637 6 0.334301 0.963285 2.296183 -2.151361 31.718857 7 0.334301 0.963285 2.231830 -2.193982 31.688254 8 0.334301 0.963285 2.231830 -2.193982 31.688254 NOTE: Convergence criterion met. Non-Linear Least Squares Summary Statistics Dependent Variable LAHAIR Source DF Sum of Squares Mean Square Regression 4 211.96848830 52.99212207 Residual 74 31.68825408 0.42821965 Uncorrected Total 78 243.65674238 (Corrected Total) 77 89.69401355 Parameter Estimate Asymptotic Asymptotic 95 % Std. Error Confidence Interval Lower Upper M1 0.334301009 0.15956907908 0.0163518013 0.6522502165 M2 0.963284615 0.15486426171 0.6547099867 1.2718592438 BEND 2.231829514 0.54159495595 1.1526750307 3.3109839974 A -2.193981807 0.42579334961 -3.0423959160 -1.3455676971 Asymptotic Correlation Matrix Corr M1 M2 BEND A ---------------------------------------------------------------------------- M1 1 2.284602E-15 0.6528113069 0.7998674596 M2 2.284602E-15 1 0.5630960809 0.2394395728 BEND 0.6528113069 0.5630960809 1 0.9215513257 A 0.7998674596 0.2394395728 0.9215513257 1 prog: marsamoeba.sas (jec) 29MAR00 17:56 input file: mars.data1 ===================================================================== PROBLEM 29 An investigator believes that the level of beta carotene Y in the blood as a function of carotene X in the diet has the following functional form: _ | | c, a constant, if x < T E(Y) = | | c * exp(b * (x - T)), if x >= T |_ Here T is an unknown threshold and c and b are unknown coefficients. Assume errors are independent and normally distributed with mean zero and standard deviation sigma. 1. Given the dataset below, use the simplex program to estimate c, b, and T. Start with initial values: b = .5, c = 4, T = 4. Set itmax = 1500. 2. Using the result from part 1,. make a scatterplot of the original data with the curve for the expected value of the outcome variable overlaid. 3. Use PROC NLIN to estimate c, b, and T. Use METHOD = MARQUARDT. Start with the same initial values as specified in part 1. Are the NLIN answers consistent with the simplex answers? Which one is more likely correct? ======================================================================= OBS X Y 1 0.2 4.1791 2 0.4 5.2783 3 0.6 5.2020 4 0.8 4.9422 5 1.0 4.5159 6 1.2 5.4537 7 1.4 4.5392 8 1.6 4.2084 9 1.8 4.6268 10 2.0 5.7036 11 2.2 4.8654 12 2.4 4.7405 13 2.6 5.0699 14 2.8 4.6849 15 3.0 5.2495 16 3.2 5.4061 17 3.4 5.3747 18 3.6 4.7829 19 3.8 4.0466 20 4.0 4.8501 21 4.2 4.2060 22 4.4 5.4611 23 4.6 5.5197 24 4.8 4.6369 25 5.0 5.2025 26 5.2 5.9291 27 5.4 5.0583 28 5.6 4.8812 29 5.8 4.6772 30 6.0 6.0635 31 6.2 6.5457 32 6.4 7.1562 33 6.6 6.8768 34 6.8 7.6332 35 7.0 7.1068 36 7.2 8.1979 37 7.4 6.5250 38 7.6 8.4432 39 7.8 9.7943 40 8.0 8.4516 41 8.2 10.3283 42 8.4 9.1733 43 8.6 9.4781 44 8.8 12.0051 45 9.0 11.9928 46 9.2 11.8532 47 9.4 11.0441 48 9.6 15.2666 49 9.8 10.7804 50 10.0 16.1887 ======================================================================== MINIMIZING FUNCTIONS IN SPLUS As described in notes.026, the Splus function nlmin can be used to minimize functions. In the following example, nlmin is used to compute the least squares estimate of the coefficients for the model y = b0 + b1 * x + e, where e is a random error term with expectation 0: ================================================================================= # Use of Splus function nlmin ... b <- c(0, 0) x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) y <- c(1.78345, -0.54128, -0.44503, -0.66441, -1.72841, -3.00969, + -2.28972, -4.39501, -3.14249, -4.44228) f1 <- function(v) { predy <- v[1] + v[2] * x ssq <- sum((y - predy)^2) ssq } m <- nlmin(f1, b) ================================================================================== In this case, the unknown parameters b0 and b1 are stored in a two- dimensional vector, b. The values of the predictor x are: x = 1, 2, ..., 10. The corresponding values of y are y = 1.78345, -0.54128, ..., -4.44228. The function f1 is specified as a function of the dummy variable v. All the real work in this case is done in the line m <- nlmin(f1, b). This is where the function-minimizing routine nlmin is used. The first argument to nlmin must be the name of the function being evaluated, and the second argument is the vector of values that minimize the function. The function itself in this case is the sum of squared differences between the values of y and the predicted values of y, with the latter being defined in the line predy <- v[1] + v[2] * x. Note that both x and predy are in fact vectors of length 10. Assume that the program above is called "minfun.s". In Splus, you run this program by entering: >source("minfun.s") You may then enter >m to obtain the output. The output is comprised of three items: (1) the values of the vector b which minimize the function; (2) a string saying whether convergence occurred (T or F), and (3) the type of convergence obtained. The function-minimizing program above gives the same parameter values that you would get using, for example, proc reg. Note that other arguments to nlmin are possible also: see the Splus reference manual. Note that nlmin, like the simplex method, does not require specification of derivatives. A variant of nlmin called nlminb is also available in Splus. It allows specification of ranges for the parameters. The syntax for nlmin and nlminb are not the same; see the Splus help pages for details. ================================================================================ PROBLEM 29, contin. 4. For the function G(x, y) = x^4 + y^4 - 5*x^2 -7*y^2 + 12, write an Splus program that finds where G(x, y) takes on its minimum value. 5. Use nlmin to estimate c, b, and T for the data on beta carotene given above. ================================================================================ /home/walleye/john-c/5421/notes.027 Last update: April 24, 2001