COMPUTATION OF SOME SUMMARY STATISTICS              SPH 5421 notes.008
 
     Most statistical tests and statistical estimates depend on certain
summary statistics which typically reduce a dataset of several hundred
observations to a small number of values which are arithmetic combinations
of the raw data.  The t-statistic, for example, requires computation of
the mean and variance of N numbers. The mean and variance in turn require
computation of the sum of the N numbers and the sum of their squares.
Most SAS procedures, like PROC TTEST, contain algorithms which accumulate
summary values as the dataset is passed through.  At the end of reading
the dataset the procedure then combines the accumulated values into the
desired summary statistics.

     It is not difficult to re-construct such calculations in many cases.  
Let us say, for example, you want to do your own computation of the mean,
and variance for a variable in a given dataset.  Consider the
following SAS code (line numbers have been added to facilitate later
discussion):

  
01  data examp ;
02  retain hcount hsum hsumsq 0 ;
03
04     vexamp = 1 ;
05
06     do i = 1 to 100 ;
07
08        height = 65 + 6 * rannor(-1) ;
09        if height ne . then do ;
10
11           hcount = hcount + 1 ;
12           hsum = hsum + height ;
13           hsumsq = hsumsq + height * height ;
14           output ;
15
16        end ;
17
18      end ;
19
20     run ;
21
22     data sums ;
23          set examp ;
24          by vexamp ;
25
26          if last.vexamp eq 1 and hcount ge 1 then do ;
27
28             avehgt = hsum / hcount ;
29
30             if hcount ge 2 then do ;
31
32                varhgt = (hsumsq - hsum * hsum / hcount) / (hcount - 1) ;
33                sdhgt = sqrt(varhgt) ;
34                sehgt = sdhgt / sqrt(hcount) ;
35                output ;
36
37             end ;
38
39          end ;
40
41     run ;
42
43     proc print data = sums ;
44          var hcount hsum hsumsq avehgt varhgt sdhgt sehgt ;
45     title 'Computation of mean, variance, etc. within data steps' ;
46
47     proc means data = examp  n mean var stddev stderr ;
48          var height ;
49     title 'Computation of mean, variance, etc. by PROC MEANS' ;
50
51     endsas ;


     The above program computes the mean, variance, standard deviation,
and standard error of the mean for a randomly generated set of heights.  
There are several key things to note in the program:

  1.  Line 02:  When you move from one observation to the next in a
      SAS dataset, any computed variables are by default initially set to
      missing.  The RETAIN statement causes such computed variables
      to retain the value that they had at the previous observation.
      In line 02, the variables hcount, hsum, and hsumsq are
      declared as RETAIN variables, and they are initialized to 0.

  2.  Line 04:  The variable VEXAMP is used later in an obscure way
      to tell SAS which is the last valid observation on the file.
      See below for a more complete observation.

  3.  Lines 11-13: The accumulating of the summary variables for
      counts, sums, and sums of squares of heights is done in these
      lines.  Note the output statement also.

  4.  lines 24-26:  At this point I wanted SAS to process only the
      last valid observation in the dataset.  In this case I knew there
      were exactly 100 observations in the dataset examp, but I wanted
      to write the code as if the number of valid observations were
      not known (as if the dataset had been read in from a file with
      unknown characteristics).  SAS has special variables to detect
      the last observations corresponding to a BY statement.  The
      variable last.vexamp is set to 0 for all observations which are
      not the last for a given value of vexamp, and to 1 for any
      observation which is the last.  So the 'where' statement has
      the effect of excluding all observations in the dataset examp
      except the last one, which has in it the completed and useful
      summary data on sums and sums of squares.

      The variable vexamp is defined to take on only one value
      (1).  To some extent I am tricking SAS here into telling me
      which is the last observation on the dataset.  If there is 
      a more straightforward way of doing this I would like to know
      about it.

  5.  Lines 28-34:  Here is where the computation of the desired
      statistics from the counts, sums, and sums of squares takes
      place.

  6.  Line 35:  The output statement here is executed only once.
      This means that the dataset 'sums' has in it only one
      observation.  In fact, the structure of the code here is
      such that if hcount < 2, the dataset 'sums' is empty.


     Appended is the output from this program.  Note that the dataset
computations agree with the PROC MEANS output.

=======================================================================

             Computation of mean, variance, etc. within data steps             1
                                                 19:51 Tuesday, January 25, 2000

   OBS   HCOUNT     HSUM      HSUMSQ     AVEHGT    VARHGT    SDHGT     SEHGT

    1      100    6547.12   431493.36   65.4712   28.7437   5.36132   0.53613

------------------------------------------------------------------------

                Computation of mean, variance, etc. by PROC MEANS               2
                                                 19:51 Tuesday, January 25, 2000

          Analysis Variable : HEIGHT


            N          Mean      Variance       Std Dev     Std Error
          -----------------------------------------------------------
          100    65.4711943    28.7437420     5.3613191     0.5361319
          -----------------------------------------------------------

=======================================================================

PROBLEM 10

1.  Write SAS program code which (1) reads in a datafile, and
    (2) computes the t-statistic for testing whether the mean of a
    specified X is equal to some specified value X0.  Your program
    should behave correctly if missing values are encountered.

2.  Appended is a SAS program which reads in a datafile called
    'lhs.listing'.  Embed your t-statistic code in this program
    and use it to test the null hypothesis that the mean value of
    f10cigs on the file is 30.  Check your answer by an
    appropriate use of PROC MEANS or PROC TTEST.

=======================================================================

data lhslist ;

infile '/home/gnome/john-c/lhs.listing' ;

input ncase agroup   age  f10cigs  drinks
      hgtcm   wgtkg    bmi dthcause av1cigs av5cigs ;

run ;

proc means ;

=======================================================================

ANOTHER EXAMPLE: THE MINIMUM OF THE SUM OF ABSOLUTE DIFFERENCES FUNCTION

     Assume observations X(1), X(2), ..., X(N).  Assume that you
want the number W which minimizes the sum:

     SUM [i = 1 to N] {[ |W - X(i)| }

     That is, W is the value which minimizes the sum of the
absolute differences between W and the observations.  What is W ?

     The following program is intended to show graphically what
W is.  Note that the observations are first put into an array.
Then in another data step, the sum of absolute differences is
calculated for each W which is itself an element of the array.
Then the result is graphed using both PROC PLOT (line-printer
graph) and PROC GPLOT (SASGRAPH graph).

=======================================================================


FILENAME GRAPH 'gsas.grf' ;

OPTIONS  LINESIZE = 80 MPRINT ;

GOPTIONS
         RESET = GLOBAL
         ROTATE = PORTRAIT
         FTEXT = SWISSB
         DEVICE = PSCOLOR
         GACCESS = SASGASTD
         GSFNAME = GRAPH
         GSFMODE = REPLACE
         GUNIT = PCT BORDER
         CBACK = WHITE
         HTITLE = 2 HTEXT = 1 ;

*===================================================================== ;        

%let n = 11 ;

data absmin ;
     array x(500) ;

     do i = 1 to &n ;

        x(i) = ranuni(-1) ;

     end ;

     output ;

run ;

data xabsums ;
     set absmin ;
     array msum(500) ;
     array x(500) ;

     do i = 1 to &n ;

        msum(i) = 0 ;

        do j = 1 to &n ;

           msum(i) = msum(i) + abs(x(i) - x(j)) ;

        end ;

     xobs = x(i) ;
     xabsum = msum(i) ;
     output ;

     end ;

run ;

proc sort data = xabsums ; by xobs ;

proc plot ;
     plot xabsum * xobs ;
title1 "The sum of abs diffs function ... n = &n points." ;
run ;

symbol1 i = j  v = o  c = black w = 3 h = 1.5 ;

proc gplot data = xabsums ;
     plot xabsum * xobs ;
run ;

*===================================================================== ;        
                                                                                
                The sum of abs diffs function ... n = 11 points.               1
                                             18:36 Wednesday, September 24, 2003

            Plot of XABSUM*XOBS.  Legend: A = 1 obs, B = 2 obs, etc.

XABSUM |
       |
   5.5 +
       |    A
       |
       |
       |
       |
       |
   5.0 +
       |
       |
       |
       |
       |
       |
   4.5 +
       |           A
       |
       |
       |
       |                                                                A
       |
   4.0 +
       |
       |
       |
       |
       |
       |
   3.5 +
       |
       |                                                          A
       |                      A
       |
       |
       |                                                        A
   3.0 +
       |                           AA                        A
       |
       |                                                 A
       |                                             A
       |
       |
   2.5 +
       |
       -+-------------+-------------+-------------+-------------+-------------+-
       0.0           0.2           0.4           0.6           0.8           1.0

                                          XOBS

                                                                                
/home/gnome/john-c/5421/notes.008    Last update: October 3, 2007.