PROC NPAR1WAY: Non-Parametric Statistics n54703.002 Nonparametric statistics are used for testing statistical hypotheses in situations where the true distributions of the variables are not known or where they are known but those distributions are not close to those which are assumed in carrying out parametric tests. Here by 'parametric tests' we mean tests based on specific distributions, such as z-tests (normal distribution) chi-squared tests (chi-square distribution), and many others. The following program shows the use of nonparametric statistics on a dataset showing weight gains in animals for 5 different dose levels of a substance called 'gossypol' (constituent of cottonseed): ---------------------------------------------------------------------------------- FILENAME GRAPH 'gsas.grf' ; OPTIONS LINESIZE = 80 PAGESIZE = 30 ; GOPTIONS RESET = GLOBAL ROTATE = PORTRAIT FTEXT = SWISSB DEVICE = PSCOLOR GACCESS = SASGASTD GSFNAME = GRAPH GSFMODE = REPLACE GUNIT = PCT BORDER CBACK = WHITE HTITLE = 2 HTEXT = 1 ; *===================================================================== ; footnote "~john-c/5421/gossypol.sas &sysdate &systime" ; data gossypol ; input dose n ; do i = 1 to n ; input wgtgain @@ ; output; end ; datalines ; 0 16 228 229 218 216 224 208 235 229 233 219 224 220 232 200 208 232 .04 11 186 229 220 208 228 198 222 273 216 198 213 .07 12 179 193 183 180 143 204 114 188 178 134 208 196 .10 17 130 87 135 116 118 165 151 59 126 64 78 94 150 160 122 110 178 .13 11 154 130 130 118 118 104 112 134 98 100 104 ; symbol1 v = 'o' w = 2 h = 5 c = black ; symbol2 v = 'o' w = 2 h = 5 c = black ; symbol3 v = 'o' w = 2 h = 5 c = black ; symbol4 v = 'o' w = 2 h = 5 c = black ; symbol5 v = 'o' w = 2 h = 5 c = black ; proc plot data = gossypol ; plot wgtgain * dose ; title1 'Weight gains vs Dose Levels of Gossypol' ; run ; proc npar1way anova wilcoxon median edf data = gossypol ; class dose ; var wgtgain ; title1 'PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol.' ; endsas; ---------------------------------------------------------------------------------- Weight gains vs Dose Levels of Gossypol 1 18:59 Wednesday, January 21, 2004 Plot of WGTGAIN*DOSE. Legend: A = 1 obs, B = 2 obs, etc. WGTGAIN | 300 + | A | |D |I F 200 +C C D | A E A | D A | B B C | A E C 100 + A D | B | B | | 0 + -+---------+---------+---------+---------+---------+---------+---------+ 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 DOSE ~john-c/5421/gossypol.sas 21JAN04 18:59 PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 2 18:59 Wednesday, January 21, 2004 N P A R 1 W A Y P R O C E D U R E Analysis of Variance for Variable WGTGAIN Classified by Variable DOSE DOSE N Mean Among MS Within MS 35020.7465 627.451597 0 16 222.187500 0.04 11 217.363636 F Value Prob > F 0.07 12 175.000000 55.814 0.0001 0.1 17 120.176471 0.13 11 118.363636 Average Scores Were Used for Ties ~john-c/5421/gossypol.sas 21JAN04 18:59 PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 3 18:59 Wednesday, January 21, 2004 N P A R 1 W A Y P R O C E D U R E Wilcoxon Scores (Rank Sums) for Variable WGTGAIN Classified by Variable DOSE Sum of Expected Std Dev Mean DOSE N Scores Under H0 Under H0 Score 0 16 890.500000 544.0 67.9789655 55.6562500 0.04 11 555.000000 374.0 59.0635883 50.4545455 0.07 12 395.500000 408.0 61.1366221 32.9583333 0.1 17 275.500000 578.0 69.3807412 16.2058824 0.13 11 161.500000 374.0 59.0635883 14.6818182 Average Scores Were Used for Ties Kruskal-Wallis Test (Chi-Square Approximation) CHISQ = 52.666 DF = 4 Prob > CHISQ = 0.0001 ~john-c/5421/gossypol.sas 21JAN04 18:59 PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 4 18:59 Wednesday, January 21, 2004 N P A R 1 W A Y P R O C E D U R E Median Scores (Number of Points Above Median) for Variable WGTGAIN Classified by Variable DOSE Sum of Expected Std Dev Mean DOSE N Scores Under H0 Under H0 Score 0 16 16.0 7.88059701 1.75790231 1.00000000 0.04 11 11.0 5.41791045 1.52735508 1.00000000 0.07 12 6.0 5.91044776 1.58096271 0.50000000 0.1 17 0.0 8.37313433 1.79415153 0.00000000 0.13 11 0.0 5.41791045 1.52735508 0.00000000 Average Scores Were Used for Ties Median 1-Way Analysis (Chi-Square Approximation) CHISQ = 54.176 DF = 4 Prob > CHISQ = 0.0001 ~john-c/5421/gossypol.sas 21JAN04 18:59 PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 5 18:59 Wednesday, January 21, 2004 N P A R 1 W A Y P R O C E D U R E Kolmogorov-Smirnov Test for Variable WGTGAIN Classified by Variable DOSE Deviation EDF from Mean DOSE N at Maximum at Maximum 0 16 0.00000000 -1.91044776 0.04 11 0.00000000 -1.58405960 0.07 12 0.33333333 -0.49979576 0.1 17 1.00000000 2.15386115 0.13 11 1.00000000 1.73256519 ---- ----------- 67 0.47761194 Maximum Deviation Occurred at Observation 36 Value of WGTGAIN at Maximum 178.000000 Kolmogorov-Smirnov Statistic (Asymptotic) KS = 0.457928 KSa = 3.74830 ~john-c/5421/gossypol.sas 21JAN04 18:59 PROC NPAR1WAY applied to data on weight gain vs 5 levels of gossypol. 6 18:59 Wednesday, January 21, 2004 N P A R 1 W A Y P R O C E D U R E Cramer-von Mises Test for Variable WGTGAIN Classified by Variable DOSE Summed Deviation DOSE N from Mean 0 16 2.16521023 0.04 11 0.91827966 0.07 12 0.34822684 0.1 17 1.49754164 0.13 11 1.33574457 Cramer-von Mises Statistic (Asymptotic) CM = 0.093508 CMa = 6.26500 ~john-c/5421/gossypol.sas 21JAN04 18:59 ---------------------------------------------------------------------------------- DISCUSSION OF EXAMPLE Note the input statements: input dose n ; do i = 1 to n ; input wgtgain @@ ; output; end ; The data file in this case is structured as follows: on the first line, the dose level and the number of animals given that dose are specified: for example, the first line is: 0 16 , meaning dose level 0 and n = 16 animals. Then on the second line, the weight gains for each of the 16 animals are given. The third line says: .04 11 , meaning the dose was .04 and n = 11 animals got that dose. And so on. Note that the second 'input' statement is : input wgtgain @@ ; The two @ signs mean: continue reading observations until you reach the end of the line or you have read in n observations. Note the 'output' statement: this means that after each weight gain is read in, a line of the dataset will be created. The first 5 lines of the data set will look like the following: Obs dose n wgtgain ----- ------ --- --------- 1 0 16 228 2 0 16 229 3 0 16 218 4 0 16 216 5 0 16 224 After the data were read in, PROC PLOT was called to plot the weight gains versus the dose levels. A lot of the points are overlapping. PROC GPLOT could also be used here and the better resolution would prevent overprinting of the points. The output from PROC PLOT shows rather convincing evidence of differences in weight gain between the dose groups. There appears to be a downward trend in weight gain as the dose is increased. PROC NPAR1WAY is invoked next: proc npar1way anova wilcoxon median edf data = gossypol ; The options used here are described as follows: 1. anova: Performs a standard one-way analysis of variance to test differences in means between the various groups. This test is appropriate if measurement error of the outcome variable is known to have a normal distribution. 2. Wilcoxon: This is a true nonparametric test, known also as the rank-sum test. It is appropriate for comparing data from continuous distributions in which the various groups differ in that the distributions are *shifted* away from each other. When more than two groups are being compared, this is also known as the Kruskal-Wallis test. 3. median: This test compares the observations to the overall median of the whole dataset. It is a powerful test when the distributions are symmetric and have heavy tails. 4. edf: Here 'edf' stands for 'empirical distribution function'. The test is also known as the Kolmogorov-Smirnov test for comparing the cumulative distributions between the groups. The Cramer-von Mises test is also performed. With this dataset, the first 3 tests indicate significant differences between the groups, as one might expect from examining the PROC PLOT output. The datasets are sufficiently small here that one probably is not very certain regarding the distribution of the outcome variable (weight gain) within the groups, so nonparametric tests are appropriate. But which one should you use ? Why not just use standard one-way analysis of variance? Which one should you pick if the results are different ? It must be admitted that there is not clear guidance on these questions. In general, with data of the kind shown here which almost certainly has a continuous underlying distribution and the assumption that the distributions of weight gain within the groups are shifts of one another is plausible, I would recommend the Wilcoxon test. The median and edf tests are likely to be less powerful. The ANOVA test does not appear to be a nonparametric test, but it is usually treated as such. This is because for moderate to large sample sizes, means have approximately normal distributions. Here however the sample sizes are small. A few extreme observations can distort the findings of the ANOVA test. If there appear to be outliers within the groups, one would prefer the Wilcoxon test (which is affected only by ranks of the outcome variable, not by its absolute magnitudes). Note that the plot indicates a *trend* in the data. None of the tests used here provides a test for trend. A parametric procedure, such as PROC REG, certainly provides a test for trend, but it is likely to be most powerful when the trend is linear, and, like ANOVA, the underlying assumption is that measurement errors in the outcome variable are normally and independently distributed. There are nonparametric tests for monotone trends, but they are not available in PROC NPAR1WAY. Other test options available in PROC NPAR1WAY include: 5. VW: Van der Waerden test: good for testing normality of a distribution 6. Klotz: Square of the VW test. 7. Savage: Powerful for testing against an exponential distribution or extreme-value distributions 8. Siegel-Tukey: Properties not discussed in SAS/STAT Ver 8. 9. Ansari-Bradley: Properties not discussed in SAS/STAT Ver 8. 10. Mood: Related to Wilcoxon. Properties not discussed in SAS/STAT Ver 8. ---------------------------------------------------------------------------------- PROBLEMS Problem 1. Use the data file on crime rates in Chapter 4. Print histograms of the crime rates for all states combined and for Southern states and non-Southern states separately. Perform a t-test and nonparametric tests of the hypothesis that the crime rates in the Southern states are the same as the crime rates in the non-Southern states [Note: this is not actually a proper use of hypothesis testing, since this is not a random sample of states. Perform the test as if the assignment to "Southern" and "non-Southern" had been randomly made.] Describe your findings. Problem 2. Given the following dataset, OBS GROUP X Y 1 1 -0.58149 0.3381 2 1 0.11909 0.0142 3 1 0.40898 0.1673 4 1 1.58229 2.5036 5 1 0.25558 0.0653 6 1 -0.50366 0.2537 7 1 2.56224 6.5651 8 1 0.01418 0.0002 9 1 0.89403 0.7993 10 1 -0.69543 0.4836 11 1 -0.99360 0.9872 12 1 -0.14279 0.0204 13 1 -0.26365 0.0695 14 1 -1.51597 2.2982 15 1 -0.39561 0.1565 16 1 0.62815 0.3946 17 1 1.20440 1.4506 18 1 -0.08493 0.0072 19 1 -0.29970 0.0898 20 1 -0.07620 0.0058 21 1 -1.52330 2.3204 22 1 3.00385 9.0231 23 1 -0.89299 0.7974 24 1 -1.43763 2.0668 25 1 -0.27793 0.0772 26 1 0.88995 0.7920 27 1 0.96424 0.9298 28 1 -0.80702 0.6513 29 1 -0.33802 0.1143 30 1 -0.73330 0.5377 31 1 -0.28173 0.0794 32 1 -3.60218 12.9757 33 1 -0.50744 0.2575 34 1 0.88039 0.7751 35 1 1.10071 1.2116 36 1 0.22413 0.0502 37 1 0.16220 0.0263 38 1 0.40509 0.1641 39 1 -0.58761 0.3453 40 1 -0.94528 0.8935 41 1 1.73639 3.0151 42 1 0.44392 0.1971 43 1 1.80667 3.2641 44 1 0.02912 0.0008 45 1 -1.80752 3.2671 46 1 0.39963 0.1597 47 1 -1.06043 1.1245 48 1 0.05343 0.0029 49 1 0.21036 0.0443 50 1 0.21532 0.0464 51 2 -1.77325 3.1444 52 2 -0.34354 0.1180 53 2 0.67068 0.4498 54 2 -1.58006 2.4966 55 2 2.76764 7.6599 56 2 -1.99040 3.9617 57 2 -1.66359 2.7675 58 2 0.92017 0.8467 59 2 -0.25762 0.0664 60 2 1.66384 2.7684 61 2 1.58245 2.5041 62 2 1.56187 2.4394 63 2 0.37986 0.1443 64 2 0.31203 0.0974 65 2 2.14525 4.6021 66 2 0.22862 0.0523 67 2 -3.81253 14.5354 68 2 -1.69397 2.8695 69 2 -0.35756 0.1278 70 2 2.33153 5.4360 71 2 2.67642 7.1632 72 2 -1.62174 2.6300 73 2 0.29400 0.0864 74 2 3.27067 10.6973 75 2 1.37223 1.8830 76 2 -1.04064 1.0829 77 2 -0.89953 0.8092 78 2 -3.75235 14.0801 79 2 -4.03463 16.2782 80 2 -1.76834 3.1270 81 2 -1.15576 1.3358 82 2 2.38133 5.6707 83 2 -0.34702 0.1204 84 2 -0.11424 0.0131 85 2 0.96115 0.9238 86 2 1.84926 3.4197 87 2 1.06257 1.1291 88 2 -1.69664 2.8786 89 2 1.74271 3.0370 90 2 0.29919 0.0895 91 2 1.75470 3.0790 92 2 -5.02719 25.2726 93 2 0.72882 0.5312 94 2 -2.10171 4.4172 95 2 -0.83845 0.7030 96 2 -0.29680 0.0881 97 2 0.63033 0.3973 98 2 0.71548 0.5119 99 2 -0.77999 0.6084 100 2 -3.00337 9.0202 (1) Create histograms for variables X and Y, separately for Group 1 and Group 2 (2) Create scatterplots of Y versus X, separately for Group 1 and Group 2. (3) Use the Kolmogorov-Smirnov test to test for whether the distribution of Y for Group 1 is the same as that for Group 2. n54703.002 Last update: January 25, 2004.