COMPUTATION OF SOME SUMMARY STATISTICS SPH 5421 notes.008 Most statistical tests and statistical estimates depend on certain summary statistics which typically reduce a dataset of several hundred observations to a small number of values which are arithmetic combinations of the raw data. The t-statistic, for example, requires computation of the mean and variance of N numbers. The mean and variance in turn require computation of the sum of the N numbers and the sum of their squares. Most SAS procedures, like PROC TTEST, contain algorithms which accumulate summary values as the dataset is passed through. At the end of reading the dataset the procedure then combines the accumulated values into the desired summary statistics. It is not difficult to re-construct such calculations in many cases. Let us say, for example, you want to do your own computation of the mean, and variance for a variable in a given dataset. Consider the following SAS code (line numbers have been added to facilitate later discussion): 01 data examp ; 02 retain hcount hsum hsumsq 0 ; 03 04 vexamp = 1 ; 05 06 do i = 1 to 100 ; 07 08 height = 65 + 6 * rannor(-1) ; 09 if height ne . then do ; 10 11 hcount = hcount + 1 ; 12 hsum = hsum + height ; 13 hsumsq = hsumsq + height * height ; 14 output ; 15 16 end ; 17 18 end ; 19 20 run ; 21 22 data sums ; 23 set examp ; 24 by vexamp ; 25 26 if last.vexamp eq 1 and hcount ge 1 then do ; 27 28 avehgt = hsum / hcount ; 29 30 if hcount ge 2 then do ; 31 32 varhgt = (hsumsq - hsum * hsum / hcount) / (hcount - 1) ; 33 sdhgt = sqrt(varhgt) ; 34 sehgt = sdhgt / sqrt(hcount) ; 35 output ; 36 37 end ; 38 39 end ; 40 41 run ; 42 43 proc print data = sums ; 44 var hcount hsum hsumsq avehgt varhgt sdhgt sehgt ; 45 title 'Computation of mean, variance, etc. within data steps' ; 46 47 proc means data = examp n mean var stddev stderr ; 48 var height ; 49 title 'Computation of mean, variance, etc. by PROC MEANS' ; 50 51 endsas ; The above program computes the mean, variance, standard deviation, and standard error of the mean for a randomly generated set of heights. There are several key things to note in the program: 1. Line 02: When you move from one observation to the next in a SAS dataset, any computed variables are by default initially set to missing. The RETAIN statement causes such computed variables to retain the value that they had at the previous observation. In line 02, the variables hcount, hsum, and hsumsq are declared as RETAIN variables, and they are initialized to 0. 2. Line 04: The variable VEXAMP is used later in an obscure way to tell SAS which is the last valid observation on the file. See below for a more complete observation. 3. Lines 11-13: The accumulating of the summary variables for counts, sums, and sums of squares of heights is done in these lines. Note the output statement also. 4. lines 24-26: At this point I wanted SAS to process only the last valid observation in the dataset. In this case I knew there were exactly 100 observations in the dataset examp, but I wanted to write the code as if the number of valid observations were not known (as if the dataset had been read in from a file with unknown characteristics). SAS has special variables to detect the last observations corresponding to a BY statement. The variable last.vexamp is set to 0 for all observations which are not the last for a given value of vexamp, and to 1 for any observation which is the last. So the 'where' statement has the effect of excluding all observations in the dataset examp except the last one, which has in it the completed and useful summary data on sums and sums of squares. The variable vexamp is defined to take on only one value (1). To some extent I am tricking SAS here into telling me which is the last observation on the dataset. If there is a more straightforward way of doing this I would like to know about it. 5. Lines 28-34: Here is where the computation of the desired statistics from the counts, sums, and sums of squares takes place. 6. Line 35: The output statement here is executed only once. This means that the dataset 'sums' has in it only one observation. In fact, the structure of the code here is such that if hcount < 2, the dataset 'sums' is empty. Appended is the output from this program. Note that the dataset computations agree with the PROC MEANS output. ======================================================================= Computation of mean, variance, etc. within data steps 1 19:51 Tuesday, January 25, 2000 OBS HCOUNT HSUM HSUMSQ AVEHGT VARHGT SDHGT SEHGT 1 100 6547.12 431493.36 65.4712 28.7437 5.36132 0.53613 ------------------------------------------------------------------------ Computation of mean, variance, etc. by PROC MEANS 2 19:51 Tuesday, January 25, 2000 Analysis Variable : HEIGHT N Mean Variance Std Dev Std Error ----------------------------------------------------------- 100 65.4711943 28.7437420 5.3613191 0.5361319 ----------------------------------------------------------- ======================================================================= PROBLEM 10 1. Write SAS program code which (1) reads in a datafile, and (2) computes the t-statistic for testing whether the mean of a specified X is equal to some specified value X0. Your program should behave correctly if missing values are encountered. 2. Appended is a SAS program which reads in a datafile called 'lhs.listing'. Embed your t-statistic code in this program and use it to test the null hypothesis that the mean value of f10cigs on the file is 30. Check your answer by an appropriate use of PROC MEANS or PROC TTEST. ======================================================================= data lhslist ; infile '/home/gnome/john-c/lhs.listing' ; input ncase agroup age f10cigs drinks hgtcm wgtkg bmi dthcause av1cigs av5cigs ; run ; proc means ; ======================================================================= ANOTHER EXAMPLE: THE MINIMUM OF THE SUM OF ABSOLUTE DIFFERENCES FUNCTION Assume observations X(1), X(2), ..., X(N). Assume that you want the number W which minimizes the sum: SUM [i = 1 to N] {[ |W - X(i)| } That is, W is the value which minimizes the sum of the absolute differences between W and the observations. What is W ? The following program is intended to show graphically what W is. Note that the observations are first put into an array. Then in another data step, the sum of absolute differences is calculated for each W which is itself an element of the array. Then the result is graphed using both PROC PLOT (line-printer graph) and PROC GPLOT (SASGRAPH graph). ======================================================================= FILENAME GRAPH 'gsas.grf' ; OPTIONS LINESIZE = 80 MPRINT ; GOPTIONS RESET = GLOBAL ROTATE = PORTRAIT FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; %let n = 11 ; data absmin ; array x(500) ; do i = 1 to &n ; x(i) = ranuni(-1) ; end ; output ; run ; data xabsums ; set absmin ; array msum(500) ; array x(500) ; do i = 1 to &n ; msum(i) = 0 ; do j = 1 to &n ; msum(i) = msum(i) + abs(x(i) - x(j)) ; end ; xobs = x(i) ; xabsum = msum(i) ; output ; end ; run ; proc sort data = xabsums ; by xobs ; proc plot ; plot xabsum * xobs ; title1 "The sum of abs diffs function ... n = &n points." ; run ; symbol1 i = j v = o c = black w = 3 h = 1.5 ; proc gplot data = xabsums ; plot xabsum * xobs ; run ; *===================================================================== ; The sum of abs diffs function ... n = 11 points. 1 18:36 Wednesday, September 24, 2003 Plot of XABSUM*XOBS. Legend: A = 1 obs, B = 2 obs, etc. XABSUM | | 5.5 + | A | | | | | 5.0 + | | | | | | 4.5 + | A | | | | A | 4.0 + | | | | | | 3.5 + | | A | A | | | A 3.0 + | AA A | | A | A | | 2.5 + | -+-------------+-------------+-------------+-------------+-------------+- 0.0 0.2 0.4 0.6 0.8 1.0 XOBS /home/gnome/john-c/5421/notes.008 Last update: October 3, 2007.