School of Public Health

Department of Biostatistics


PubH 7407: Analysis of Categorical Data - Spring 2009


Course Objectives

This course provides a graduate-level introduction to models and methods for analyzing categorical data. The best known categorical data model is logistic regression. We will explore this model in detail, as well as log-linear models, ordinal regression models, and approaches to handling correlated data with categorical outcomes. Course syllabus (PDF).


Material covered

  • Tues., Jan. 20: Sections 1.1, 1.2, 1.3. Class notes.
  • Thur., Jan. 22: Sections 1.4, 1.5. Class notes.
  • Tues., Jan. 27: Sections 2.1, 2.2. Class notes.
  • Thur., Jan. 29: Sections 2.3, 2.4. Class notes.
  • Tues., Feb. 3: Sections 3.1, 3.2, 3.3. Class notes.
  • Thur., Feb. 5: Sections 3.4, 3.5. Class notes. Some sample code.
  • Tues., Feb. 10: Sections 4.1, 4.2, 4.3. Class notes.
  • Thur., Feb. 12: Sections 4.3, 4.4, 4.5, 4.6, 4.7. Class notes.
  • Tues., Feb. 17: Chapter 4, continued.
  • Thur., Feb. 19: Sections 5.1, 5.2, 5.3. Class notes.
  • Tues., Feb. 24: Sections 5.4, 5.5. Class notes. Some material on complete and quasicomplete separation.
  • Thur., Feb. 26: Chapter 5 continued.
  • Tues., Mar. 3: Chapter 5 continued.
  • Thur., Mar. 5: Sections 6.1, 6.2, 6.3. Course notes.
  • Tues., Mar. 10: Review. Sample problems: most homework (including odd) problems. Midterm from 2007. Midterm from 2008 with answers. Some review notes.
  • Thur., Mar. 12: Exam I.
  • Tues., Mar. 24: Sections 6.3, 6.4, 6.5, 6.6.Course notes. Exam I answer key.
  • Thur., Mar. 26: Chapter 6 continued.
  • Tues., Mar. 31: Sections 7.1, 7.2, 7.4.3. Course notes.
  • Thur., Apr. 2: Chapter 7, continued.
  • Tues., Apr. 7: Sections 10.1, 10.2. Course notes.
  • Thur., Apr. 9: Sections 11.3, 11.4. Course notes.
  • Tues., Apr. 14: Sections 11.5, 12.1, 12.2. Course notes.
  • Thur., Apr. 16: Sections 12.2, 12.3. Course notes.
  • Tues., Apr. 21: Sections 12.3, 12.6, 12.4. Course notes. Section 12.5. Course notes.
  • Thur., Apr. 23: SAS's GLIMMIX. Course notes. Sections 10.4.1, 10.5.4, and 4.8. Course notes.
  • Tues., Apr. 28. Continued.
  • Thur., Apr. 30. A bit on sensitivity, specificity, and ROC curves. notes and plots in Word.
  • Tues., May 5: Review Chapters 6, 7, 10, 11, 12. Review notes. 2007 Exam II. 2008 Exam II.
  • Thur., May 7: Exam II. Answer key.
  • Homework assignments

  • Homework 1, due Thursday Feb. 5, Chapter 1: 1, 2*, 3, 4*, 6*, 7, 8*, 10*, 12ab* (hint: a formula from probability helps), 17ab, 30* (part b is extra credit), 31. Chapter 2: 1, 2*, 3, 4*, 5, 7, 8*, 9, 10*. * = hand in.
  • Homework 2, due Thursday Feb. 12, Chapter 2: 12*, 15, 18abc*, 19, 20*, 21, 29, 30*. Chapter 3: 1, 2*, 3, 4*, 5, 9. * = hand in. There is an extra credit problem in the notes involving local odds ratios.
  • Homework 3, due Thursday Feb. 19, Chapter 3: 10*, 11, 12* (also obtain estimate and 95% CI for polychoric correlation), 31, 32ab*. Chapter 4: 1, 2*, 3.
  • Homework 4, due Thursday Feb. 26, Chapter 4: 5, 6abc*, 7, 8*, 11, 12*, 13, 14*, 15, 17, 18*, 19, 21, 22 (this is why it's important to group!), 28 (extra credit), 30*, 32*.
  • Homework 5, due Friday Mar. 6 by noon, Chapter 5: 1abcefgh, 2* (also report H-L test), 4* (also report H-L test for linear and quadratic models), 6*, 8*, 12* (check for interaction as well -- here, the interaction model is also the saturated model), 15, 16a*, 17* (check for interaction & report H-L too), 19, 22*, 26*, 28*. Good problems, but beyond our scope: 33, 34, 37, 42. Sample code for space shuttle data.
  • Homework 6, due Thursday Apr. 2. vasoconstriction data; heart data. For the problems involving data from the Los Angeles Heart Study and problem 5.26, only consider logistic regression, i.e. not probit or cloglog.
  • Homework 7, due Thursday, April 9. Chapter 7: 1abc, 2*, 3, 4*, 7, 9*, 10*, 22*, 29.
  • Homework 8, due Thursday, April 16. Chapter 10: 1*, 4*. For 4, perform a conditional analysis in proc logistic using the strata and exact commands and interpret the results. Do not do parts (a) through (e). Here's the data. Chapter 11: 2* (hint: you will define a CLASS variable called "substance" with three levels), 3, 6, 7b, 8*, 9, 10*. Substance use data set; variables are subject, use (yes/no=1/0), type (1,2,3=alcohol, cigarettes, marijuana), race (1/0=white/other), and gender (1/0=female/male). Sample SAS code to get the data for problem 10 in a good form for GENMOD.
  • Homework 9, due April 28. Paper and data for multi-site clinical trial. Teratology data.
  • Extra credit problem and data. Optional, but due April 28 if you do it.
  • Homework solutions from two years ago: Solutions 1, Solutions 2, Solutions 3, Solutions 4, Solutions 5. More solutions: solutions 6; solutions 7 with code and output; solutions 8.