PROC NPAR1WAY: Non-Parametric Statistics n54703.002
Nonparametric statistics are used for testing statistical hypotheses
in situations where the true distributions of the variables are not known
or where they are known but those distributions are not close to those
which are assumed in carrying out parametric tests. Here by 'parametric
tests' we mean tests based on specific distributions, such as z-tests
(normal distribution) chi-squared tests (chi-square distribution),
and many others.
The following program shows the use of nonparametric statistics on
a dataset showing weight gains in animals for 5 different dose levels
of a substance called 'gossypol' (constituent of cottonseed):
----------------------------------------------------------------------------------
FILENAME GRAPH 'gsas.grf' ;
OPTIONS LINESIZE = 80 PAGESIZE = 30 ;
GOPTIONS
RESET = GLOBAL
ROTATE = PORTRAIT
FTEXT = SWISSB
DEVICE = PSCOLOR
GACCESS = SASGASTD
GSFNAME = GRAPH
GSFMODE = REPLACE
GUNIT = PCT BORDER
CBACK = WHITE
HTITLE = 2 HTEXT = 1 ;
*===================================================================== ;
footnote "~john-c/5421/gossypol.sas &sysdate &systime" ;
data gossypol ;
input dose n ;
do i = 1 to n ;
input wgtgain @@ ;
output;
end ;
datalines ;
0 16
228 229 218 216 224 208 235 229 233 219 224 220 232 200 208 232
.04 11
186 229 220 208 228 198 222 273 216 198 213
.07 12
179 193 183 180 143 204 114 188 178 134 208 196
.10 17
130 87 135 116 118 165 151 59 126 64 78 94 150 160 122 110 178
.13 11
154 130 130 118 118 104 112 134 98 100 104
;
symbol1 v = 'o' w = 2 h = 5 c = black ;
symbol2 v = 'o' w = 2 h = 5 c = black ;
symbol3 v = 'o' w = 2 h = 5 c = black ;
symbol4 v = 'o' w = 2 h = 5 c = black ;
symbol5 v = 'o' w = 2 h = 5 c = black ;
proc plot data = gossypol ;
plot wgtgain * dose ;
title1 'Weight gains vs Dose Levels of Gossypol' ;
run ;
proc npar1way anova wilcoxon median edf data = gossypol ;
class dose ;
var wgtgain ;
title1 'PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.' ;
endsas;
----------------------------------------------------------------------------------
Weight gains vs Dose Levels of Gossypol 1
18:59 Wednesday, January 21, 2004
Plot of WGTGAIN*DOSE. Legend: A = 1 obs, B = 2 obs, etc.
WGTGAIN |
300 +
| A
|
|D
|I F
200 +C C D
| A E A
| D A
| B B C
| A E C
100 + A D
| B
| B
|
|
0 +
-+---------+---------+---------+---------+---------+---------+---------+
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14
DOSE
~john-c/5421/gossypol.sas 21JAN04 18:59
PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 2
18:59 Wednesday, January 21, 2004
N P A R 1 W A Y P R O C E D U R E
Analysis of Variance for Variable WGTGAIN
Classified by Variable DOSE
DOSE N Mean Among MS Within MS
35020.7465 627.451597
0 16 222.187500
0.04 11 217.363636 F Value Prob > F
0.07 12 175.000000 55.814 0.0001
0.1 17 120.176471
0.13 11 118.363636
Average Scores Were Used for Ties
~john-c/5421/gossypol.sas 21JAN04 18:59
PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 3
18:59 Wednesday, January 21, 2004
N P A R 1 W A Y P R O C E D U R E
Wilcoxon Scores (Rank Sums) for Variable WGTGAIN
Classified by Variable DOSE
Sum of Expected Std Dev Mean
DOSE N Scores Under H0 Under H0 Score
0 16 890.500000 544.0 67.9789655 55.6562500
0.04 11 555.000000 374.0 59.0635883 50.4545455
0.07 12 395.500000 408.0 61.1366221 32.9583333
0.1 17 275.500000 578.0 69.3807412 16.2058824
0.13 11 161.500000 374.0 59.0635883 14.6818182
Average Scores Were Used for Ties
Kruskal-Wallis Test (Chi-Square Approximation)
CHISQ = 52.666 DF = 4 Prob > CHISQ = 0.0001
~john-c/5421/gossypol.sas 21JAN04 18:59
PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 4
18:59 Wednesday, January 21, 2004
N P A R 1 W A Y P R O C E D U R E
Median Scores (Number of Points Above Median)
for Variable WGTGAIN
Classified by Variable DOSE
Sum of Expected Std Dev Mean
DOSE N Scores Under H0 Under H0 Score
0 16 16.0 7.88059701 1.75790231 1.00000000
0.04 11 11.0 5.41791045 1.52735508 1.00000000
0.07 12 6.0 5.91044776 1.58096271 0.50000000
0.1 17 0.0 8.37313433 1.79415153 0.00000000
0.13 11 0.0 5.41791045 1.52735508 0.00000000
Average Scores Were Used for Ties
Median 1-Way Analysis (Chi-Square Approximation)
CHISQ = 54.176 DF = 4 Prob > CHISQ = 0.0001
~john-c/5421/gossypol.sas 21JAN04 18:59
PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 5
18:59 Wednesday, January 21, 2004
N P A R 1 W A Y P R O C E D U R E
Kolmogorov-Smirnov Test for Variable WGTGAIN
Classified by Variable DOSE
Deviation
EDF from Mean
DOSE N at Maximum at Maximum
0 16 0.00000000 -1.91044776
0.04 11 0.00000000 -1.58405960
0.07 12 0.33333333 -0.49979576
0.1 17 1.00000000 2.15386115
0.13 11 1.00000000 1.73256519
---- -----------
67 0.47761194
Maximum Deviation Occurred at Observation 36
Value of WGTGAIN at Maximum 178.000000
Kolmogorov-Smirnov Statistic (Asymptotic)
KS = 0.457928 KSa = 3.74830
~john-c/5421/gossypol.sas 21JAN04 18:59
PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 6
18:59 Wednesday, January 21, 2004
N P A R 1 W A Y P R O C E D U R E
Cramer-von Mises Test for Variable WGTGAIN
Classified by Variable DOSE
Summed
Deviation
DOSE N from Mean
0 16 2.16521023
0.04 11 0.91827966
0.07 12 0.34822684
0.1 17 1.49754164
0.13 11 1.33574457
Cramer-von Mises Statistic (Asymptotic)
CM = 0.093508 CMa = 6.26500
~john-c/5421/gossypol.sas 21JAN04 18:59
----------------------------------------------------------------------------------
DISCUSSION OF EXAMPLE
Note the input statements:
input dose n ;
do i = 1 to n ;
input wgtgain @@ ;
output;
end ;
The data file in this case is structured as follows: on the first line,
the dose level and the number of animals given that dose are specified:
for example, the first line is: 0 16 , meaning dose level 0 and n = 16
animals. Then on the second line, the weight gains for each of the 16 animals
are given. The third line says: .04 11 , meaning the dose was .04 and
n = 11 animals got that dose. And so on.
Note that the second 'input' statement is : input wgtgain @@ ;
The two @ signs mean: continue reading observations until you reach the
end of the line or you have read in n observations.
Note the 'output' statement: this means that after each weight gain is
read in, a line of the dataset will be created. The first 5 lines of the
data set will look like the following:
Obs dose n wgtgain
----- ------ --- ---------
1 0 16 228
2 0 16 229
3 0 16 218
4 0 16 216
5 0 16 224
After the data were read in, PROC PLOT was called to plot the weight gains
versus the dose levels. A lot of the points are overlapping. PROC GPLOT
could also be used here and the better resolution would prevent overprinting
of the points. The output from PROC PLOT shows rather convincing evidence
of differences in weight gain between the dose groups. There appears to be
a downward trend in weight gain as the dose is increased.
PROC NPAR1WAY is invoked next:
proc npar1way anova wilcoxon median edf data = gossypol ;
The options used here are described as follows:
1. anova: Performs a standard one-way analysis of variance to test
differences in means between the various groups. This test is
appropriate if measurement error of the outcome variable
is known to have a normal distribution.
2. Wilcoxon: This is a true nonparametric test, known also as the rank-sum
test. It is appropriate for comparing data from continuous
distributions in which the various groups differ in that
the distributions are *shifted* away from each other. When
more than two groups are being compared, this is also known
as the Kruskal-Wallis test.
3. median: This test compares the observations to the overall median of
the whole dataset. It is a powerful test when the distributions
are symmetric and have heavy tails.
4. edf: Here 'edf' stands for 'empirical distribution function'. The
test is also known as the Kolmogorov-Smirnov test for
comparing the cumulative distributions between the groups.
The Cramer-von Mises test is also performed.
With this dataset, the first 3 tests indicate significant differences
between the groups, as one might expect from examining the PROC PLOT
output. The datasets are sufficiently small here that one probably
is not very certain regarding the distribution of the outcome variable
(weight gain) within the groups, so nonparametric tests are appropriate.
But which one should you use ? Why not just use standard one-way
analysis of variance? Which one should you pick if the results are
different ?
It must be admitted that there is not clear guidance on these questions.
In general, with data of the kind shown here which almost certainly has
a continuous underlying distribution and the assumption that the distributions
of weight gain within the groups are shifts of one another is plausible,
I would recommend the Wilcoxon test. The median and edf tests are likely
to be less powerful.
The ANOVA test does not appear to be a nonparametric test, but it is
usually treated as such. This is because for moderate to large sample
sizes, means have approximately normal distributions. Here however the
sample sizes are small. A few extreme observations can distort the findings
of the ANOVA test. If there appear to be outliers within the groups, one
would prefer the Wilcoxon test (which is affected only by ranks of the
outcome variable, not by its absolute magnitudes).
Note that the plot indicates a *trend* in the data. None of the
tests used here provides a test for trend. A parametric procedure,
such as PROC REG, certainly provides a test for trend, but it
is likely to be most powerful when the trend is linear, and, like
ANOVA, the underlying assumption is that measurement errors in the
outcome variable are normally and independently distributed. There
are nonparametric tests for monotone trends, but they are not
available in PROC NPAR1WAY.
Other test options available in PROC NPAR1WAY include:
5. VW: Van der Waerden test: good for testing normality of a
distribution
6. Klotz: Square of the VW test.
7. Savage: Powerful for testing against an exponential distribution
or extreme-value distributions
8. Siegel-Tukey: Properties not discussed in SAS/STAT Ver 8.
9. Ansari-Bradley: Properties not discussed in SAS/STAT Ver 8.
10. Mood: Related to Wilcoxon. Properties not discussed in SAS/STAT Ver 8.
----------------------------------------------------------------------------------
PROBLEMS
Problem 1.
Use the data file on crime rates in Chapter 4. Print histograms of the crime
rates for all states combined and for Southern states and non-Southern states
separately.
Perform a t-test and nonparametric tests of the hypothesis that the crime
rates in the Southern states are the same as the crime rates in the non-Southern
states [Note: this is not actually a proper use of hypothesis testing,
since this is not a random sample of states. Perform the test as if
the assignment to "Southern" and "non-Southern" had been randomly made.]
Describe your findings.
Problem 2.
Given the following dataset,
OBS GROUP X Y
1 1 -0.58149 0.3381
2 1 0.11909 0.0142
3 1 0.40898 0.1673
4 1 1.58229 2.5036
5 1 0.25558 0.0653
6 1 -0.50366 0.2537
7 1 2.56224 6.5651
8 1 0.01418 0.0002
9 1 0.89403 0.7993
10 1 -0.69543 0.4836
11 1 -0.99360 0.9872
12 1 -0.14279 0.0204
13 1 -0.26365 0.0695
14 1 -1.51597 2.2982
15 1 -0.39561 0.1565
16 1 0.62815 0.3946
17 1 1.20440 1.4506
18 1 -0.08493 0.0072
19 1 -0.29970 0.0898
20 1 -0.07620 0.0058
21 1 -1.52330 2.3204
22 1 3.00385 9.0231
23 1 -0.89299 0.7974
24 1 -1.43763 2.0668
25 1 -0.27793 0.0772
26 1 0.88995 0.7920
27 1 0.96424 0.9298
28 1 -0.80702 0.6513
29 1 -0.33802 0.1143
30 1 -0.73330 0.5377
31 1 -0.28173 0.0794
32 1 -3.60218 12.9757
33 1 -0.50744 0.2575
34 1 0.88039 0.7751
35 1 1.10071 1.2116
36 1 0.22413 0.0502
37 1 0.16220 0.0263
38 1 0.40509 0.1641
39 1 -0.58761 0.3453
40 1 -0.94528 0.8935
41 1 1.73639 3.0151
42 1 0.44392 0.1971
43 1 1.80667 3.2641
44 1 0.02912 0.0008
45 1 -1.80752 3.2671
46 1 0.39963 0.1597
47 1 -1.06043 1.1245
48 1 0.05343 0.0029
49 1 0.21036 0.0443
50 1 0.21532 0.0464
51 2 -1.77325 3.1444
52 2 -0.34354 0.1180
53 2 0.67068 0.4498
54 2 -1.58006 2.4966
55 2 2.76764 7.6599
56 2 -1.99040 3.9617
57 2 -1.66359 2.7675
58 2 0.92017 0.8467
59 2 -0.25762 0.0664
60 2 1.66384 2.7684
61 2 1.58245 2.5041
62 2 1.56187 2.4394
63 2 0.37986 0.1443
64 2 0.31203 0.0974
65 2 2.14525 4.6021
66 2 0.22862 0.0523
67 2 -3.81253 14.5354
68 2 -1.69397 2.8695
69 2 -0.35756 0.1278
70 2 2.33153 5.4360
71 2 2.67642 7.1632
72 2 -1.62174 2.6300
73 2 0.29400 0.0864
74 2 3.27067 10.6973
75 2 1.37223 1.8830
76 2 -1.04064 1.0829
77 2 -0.89953 0.8092
78 2 -3.75235 14.0801
79 2 -4.03463 16.2782
80 2 -1.76834 3.1270
81 2 -1.15576 1.3358
82 2 2.38133 5.6707
83 2 -0.34702 0.1204
84 2 -0.11424 0.0131
85 2 0.96115 0.9238
86 2 1.84926 3.4197
87 2 1.06257 1.1291
88 2 -1.69664 2.8786
89 2 1.74271 3.0370
90 2 0.29919 0.0895
91 2 1.75470 3.0790
92 2 -5.02719 25.2726
93 2 0.72882 0.5312
94 2 -2.10171 4.4172
95 2 -0.83845 0.7030
96 2 -0.29680 0.0881
97 2 0.63033 0.3973
98 2 0.71548 0.5119
99 2 -0.77999 0.6084
100 2 -3.00337 9.0202
(1) Create histograms for variables X and Y, separately for
Group 1 and Group 2
(2) Create scatterplots of Y versus X, separately for Group 1
and Group 2.
(3) Use the Kolmogorov-Smirnov test to test for whether the
distribution of Y for Group 1 is the same as that for Group 2.
n54703.002 Last update: January 25, 2004.