PubH 7475/8475 Update Page

PubH 7475/8475 Statistical Learning and Data Mining

Spring 2022 Updates

Instructors: Dr. Wei Pan, panxx014@umn.edu

Week 14-15: Project presentations
Week 13 W:
Community detection in networks.
notes.
1. Download: Neuman MEJ. Detecting community structure in networks.
2. Download: Blondel VD, Guillaume J-L, Lambiotte R, Lefebvre E. (2008). Fast unfolding of communities in large networks. arXiv:0803.0476
3. Download, or preprint: Zhao Y, Levina E, Zhu J (2012). Consistency of community detection in networks under degree-corrected stochastic block models. Ann. Statist. Volume 40, Number 4 (2012), 2266-2292.
4. Download: Fortunato S (2010). Community detection in graphs. Physics Reports 486, 75-174.
5. Download: David Meunier, Renaud Lambiotte and Edward T. Bullmore (2010). Modular and hierarchically modular organization of brain networks. Front. Neurosci., 4, 200.
Week 13 M: Semi-supervised learning.
SSL notes.
1. Download: Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton, (2020). A Simple Framework for Contrastive Learning of Visual Representations. ICMK 2020.
2. Download: Peng Liu , Yusi Fang, Zhao Ren, Lu Tang, George C. Tseng (2021). Outcome-Guided Disease Subtyping for High-Dimensional Omics Data. arXiv:2007.11123
3. Download: Wagstaff et al (2001). Constrained K-means Clustering with Background Knowledge.
4. Download: Liu B, Shen X, Pan W (2013). Semi-supervised spectral clustering with application to detect population stratification. Frontiers in Genetics. 4:215. doi:10.3389/fgene.2013.00215.
5. Download: Wang J, Shen X, Pan W. (2009). On efficient large margin semisupervised learning: method and theory. Journal of Machine Learning Research, 10, 719-742.
6. Download: Wang, J., Shen, X., and Pan, W. (2006). On transductive support vector machines. Contemp. Math., 43, 7-19.
7. Download: Wei Pan, Xiaotong Shen, Aixiang Jiang, and Robert P. Hebbel (2006). Semi-supervised learning via penalized mixture model with application to microarray sample classification. Bioinformatics, 22, 2388-2395.
HWK5 due on April 20; .
Week 12: Unsupervised learning (Chapter 14).
1. Download: Tibshirani R, Walther G (2005). Clustering validation by prediction strength. JCGS, 14, 511-528.
2. Download: Liu Y, Hayes DN, Nobel A, Marron JS (2012). Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data. JASA, 103, 1281-1293.
3. Download: McShane LM, Radmacher MD, Freidlin B, Yu R, Li MC, Simon R. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics. 18(11):1462-9.
4. Download: Candes EJ, Li X, Ma Y, Wright J (2009). Robust Principal Component Analysis?
5. Download: Shen Y, Wen Z, Zhang Y (2011). Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization.
6. (Review) Download: Zhang Z, Jordan MI (2008). Multiway spectral clustering: a margin-based perspective. Stat Sci, 23, 383-403.
7. (Review) Download: von Luxburg. A tutorial on spectral clustering.
8. Download: Ng AY, Jordan MI, Weiss Y. On spectral clustering: analysis and an algorithm.
9. Download: Pan W, Shen X, Liu B (2013). Cluster Analysis: Unsupervised Learning via Supervised Learning with a Non-convex Penalty. Journal of Machine Learning Research, 14, 1865-.
10. Download: Pan W, Shen X (2007). Penalized Model-Based Clustering with Application to Variable Selection. Journal of Machine Learning Research, 8, 1145-1164.
11. Download: Li J, Ray S, Lindsay BG (2007). A nonparametric statistical approach to clustering via mode identification. Journal of Machine Learning Research, 8, 1687-1723.
Week 11: Unsupervised learning (Chapter 14).
clustering notes.
1. (Review) Download: Xu R, Wunsch D (2005). Survey of clustering algorithms. IEEE Transaction on Neural Networks, 16, 645-678.
Project proposal due on April 6
HWK4 due on March 30 (which can be extended to April 6)
Mid-term exam is scheduled in class on Wednesday, March 23, 2022. It will be a closed-book, one-hour in-class exam: no books or notes allowed; it will cover the contents up to the end of the class on March 21.
Old exam, old sum
Weeks 8-10: FNNs (Chapter 11); CNNs; RNNs; two applications: protein subcellular localization prediction with micorscopic cell images (Xiao et al 2019); enhancer-promoter interaction prediction with DNA seq (Zhuang et al 2019); RL.
FNN&CNN notes, R/Keras FNN&CNN, RNN, RL notes.
Mengli's slides and example on CNNs in R. SLDS'18 slides
1. Download: LeCun et al (1998). Gradient-based learning applied to document recognition. Proc of IEEE. (Comment: Section I. p.5-7 most helpful to understand convolutional NNs.)
2. Download: Krizhevsky A, Sutskever I, Hinton G. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS.
3. Download: Zhou J and Troyanskaya OG (2015). Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12, 931-934.
4. Download: Silver et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.
5. Download: Xiao M, Shen X, Pan W. (2019). Application of deep convolutional neural networks in classification of protein subcellular localization with microscopy images. Genetic Epi, 43(3), 330-341.
6. Download: Zhuang Z, Shen X, Pan W. (2019). A simple convolutional neural network for prediction of enhancer-promoter interactions with DNA sequence data. Bioinformatics, 35(17), 2899-2906.
7. Download: Fan J, Ma C, Zhong Y. (2019). A Selective Overview of Deep Learning. arXiv:1904.05526.
8. Download: Bernhard Schölkopf, Francesco Locatello, Stefan Bauer, Nan Rosemary Ke, Nal Kalchbrenner, Anirudh Goyal, Yoshua Bengio. (2021). Towards Causal Representation Learning. arXiv:2102.11107
9. Download: Zech JR, Badgeley MA, Liu M, Costa AB, Titano JJ, Oermann EK (2018) Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study. PLoS Med 15(11): e1002683. doi:10.1371/journal.pmed.1002683.
10. Download: Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, Jonathan K. Su (2019). This Looks Like That: Deep Learning for Interpretable Image Recognition. Advances in Neural Information Processing Systems 32 (NeurIPS 2019).
11. Download: Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision (IJCV) 2019.
12. Download here or at the Journal : Stephanie Clark, Rob J Hyndman, Dan Pagendam, Louise M Ryan. (2020). Modern strategies for time series regression. International Stat Rev, 88(S1), S179-S204.
13. Download: Volodymyr Mnih et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
Week 7: More on RF; Support vector machines (Chapter 12). SVM notes,
1. Download Stefan Wager, Trevor Hastie, Bradley Efron. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. JMLR 15(48):1625-1651.
2. Download Lucas Mentch, Giles Hooker. (2016). Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. JMLR 17: 1-41.
3. Download Ishwaran H, Lu M. (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 38(4):558-582.
4. Download Hugh A. Chipman, Edward I. George, Robert E. McCulloch. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics 4(1), 266-298.
5. Download Lu M, Sadiq S, Feaster DJ, Ishwaran H. (2018). Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods. J Comput Graph Stat. 27(1), 209-219.
6. Download or here: Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statist Sci, 34, 43-68.
7. Download: Wang, L., and Shen, X. (2007). On L1-norm multi-class support vector machines: methodology and theory. JASA, 102, 583-594.
8. Download: Shen, X., Tseng, G.C., Zhang, X., Wong, W.H. (2003). On psi-Learning. JASA, 98, 724-734.
9. Download: Wang J, Shen X, Liu Y. (2008). Probability estimation for large margin classifiers. Biometrika. 95, 149-167.
10. Download: Geman S, Bienenstock E, Doursat R (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4, 1-58.
11. Download: Kondor RI, Lafferty J. Diffusion kernels on graphs and other discrete structures.
12. (Review) Download: Hofmann T, Scholkopf B and Smola AJ (2008). Kernel methods in machine learning. Annals of Statistics, 36, 1171-1220.
13. (Review) Download: Javier M. Moguerza and Alberto Muqoz (2006). Support Vector Machines with Applications. Statistical Science, 21, 322-336.
  Comments and rejoiner, 337-362.
14. (Review) Download: Bing Cheng, D. M. Titterington (1994). Neural Networks: A Review from a Statistical Perspective. Statistical Science, 9, 2-30.
  Comments and rejoiner. 31-54.
15. (Review) Download: B. D. Ripley (1994). Neural Networks and Related Methods for Classification. JRSS-B, 56, 409-456.
HWK3 due on March 2
Weeks 5&6: Trees; Ensemble methods: Bagging (8.7); Bayes model averaging (BMA) and stacking (8.8), ARM (Yang, 2003); Random forest (Chapter 15); Boosting (Chapter 10): AdaBoost and GBM. notes, notes
Go to an info page for R package gbm.
1. Download Loh W-Y (2014). Fifty years of classification and regression trees (with discussion), International Statistical Review, 34, 329-370.
2. Download Loh W-Y, He X, Man M (2015). A regression tree approach to identifying subgroups with differential treatment effects. Stat Med, 34(11),1818-1833. https://doi.org/10.1002/sim.6454.
3. Download Athey S, Imbens G (2016). Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences, 113(27), 7353-7360. DOI: 10.1073/pnas.1510489113.
4. Download Breiman L (1996). Bagging predictors. Machine Learning, 24, 123-140.
5. Download Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999). Bayesian model averaging: a tutorial (with comments). Stat Sci, 14:362-417.
6. Download: Yang Y (2003). Regression with multiple candidate models: selecting or mixing? Statistica Sinica, vol. 13, 783-809.
7. Download: Yang Y (2001). Adaptive regression by mixing, JASA, vol. 96, 574-588.
8. Download: Shen X, Huang H-C (2006) Optimal model assessment, selection and combination. JASA 101:554-568.
9. Download: Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P (2007) Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Annals of Applied Statistics, 1:85-106.
10. Download: Pan W, Kim J, Zhang Y, Shen X, Wei P (2014) A powerful and adaptive association test for rare variants. Genetics, 197(4):1081-1095.
11. Download: Pan W, Xiao G and Huang X (2006). Input Dependent Weights for Model Combination and Model Selection with Multiple Sources of Data. Statistica Sinica, 16:523-540.
12. Download: Zhang Y, Yang Y (2015). Cross-validation for selecting a model selection procedure. J of Econometrics.
13. Download: Shao J (1997). AN ASYMPTOTIC THEORY FOR LINEAR MODEL SELECTION. Stat Sinica 7:221-264.
14. Download Breiman L (2001). Random forests. Machine Learning, 45, 5-32.
15. Download: Saharon Rosset, Ji Zhu, Trevor Hastie (2004). Boosting as a Regularized Path to a Maximum Margin Classifier. JMLR 5:941--973.
16. Download Friedman's MART papers.
17. Download Stefan Wager, Trevor Hastie, Bradley Efron. (2014). Confidence Intervals for Random Forests: The Jackknife and the Infinitesimal Jackknife. JMLR 15(48):1625-1651.
18. Download Lucas Mentch, Giles Hooker. (2016). Quantifying Uncertainty in Random Forests via Confidence Intervals and Hypothesis Tests. JMLR 17: 1-41.
19. Download Ishwaran H, Lu M. (2019). Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat Med. 38(4):558-582.
20. Download Hugh A. Chipman, Edward I. George, Robert E. McCulloch. (2010). BART: Bayesian additive regression trees. Annals of Applied Statistics 4(1), 266-298.
21. Download Lu M, Sadiq S, Feaster DJ, Ishwaran H. (2018). Estimating Individual Treatment Effect in Observational Data Using Random Forest Methods. J Comput Graph Stat. 27(1), 209-219.
22. Download or here: Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, and Dan Cervone. (2019). Automated versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition. Statist Sci, 34, 43-68.
HWK2 due on Feb 16
Week 4: Other penalties (3.8): SCAD (Fan and Li 2001), elastic net (Zou and Hastie 2005), adaptive LASSO (Zou 2006), TLP (Shen et al 2012), group lasso, fused lasso...; SIS; Computational algorithms and statistical inference for high-dime nsional data; methods based on derived inputs (3.5-3.6): PCR, PLS.
1. Download: Fan J, Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96 (456), 1348-1360.
2. Download: Zou H (2006), The Adaptive Lasso and Its Oracle Properties. JASA, 101, 418-1429.
3. Download: Zou H, Hastie T (2005), Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67, 301-320.
4. Download: Austin E, Pan W, Shen X. (2013). Penalized Regression and Risk Prediction in Genome-Wide Association Studies. Stat Anal Data Min. 6(4). doi: 10.1002/sam.11183.
5. Download: Zhu Y, Shen X, Pan W (2013). Simultaneous grouping pursuit and feature selection over an undirected graph. JASA, 108, 713-725.
6. Download: Kim S, Pan W, Shen X (2013). Network-based penalized regression with application to genomic data. Biometrics. 69(3), 582-593.
7. Download: Friedman J, Hastie T, Hoefling H, Tibshirani R (2007). Pathwise Coordinate Optimization. The Annals of Applied Statistics, 2(1), 302–332.
8. Download: Jerome Friedman, Trevor Hastie, Robert Tibshirani (2010). Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1-22.
9. Download: S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1):1-122.
10. Download: Shi C, Song R, Chen Z, Li R. (2019). Linear hypothesis testing for high dimensional generalized linear models. Ann Stat, 47(5), 2671-2703.
11. Download: Zhu Y, Shen X, Pan W. (2020). On High-Dimensional Constrained Maximum Likelihood Inference. JASA, 115(529), 217-230.
12. Download: Dezeure R, Buhlmann P, Meier L and Meinshausen N (2015). High-Dimensional Inference: Confidence Intervals, p-Values and R-Software hdi. Stat Sci, 30(4), 533-558.
13. Download: Fan J, Lv J (2008). Sure independence screening for ultrahigh dimensional feature space. JRSS-B 70, 849-911.
14. Download: Chun H and Keles S (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. JRSS-B, 72(1):3-25. (R packages "spls")
Week 3: LDA and QDA (4.3); RDA (4.3.1), nearest shrunken centroid (18.2), logistic regression (4.4); penalized logistic regression (18.3.2, 18.4). Linear regression: LS (3.1-3.2); Subset selection (3.3), shrinkage methods: ridge, Lasso (3.4.1-3.4.3); notes
1. Download: Banfield JD, Raftery AE (1993). Model-based Gaussian and non-gaussian clustering. Biometrics, 49, 803-821. Note: Section 2 contains a discussion on the eigen-decomposition of a covariance matrix of a Normal distribution.
2. Download: Peter J. Bickel, Elizaveta Levina (2004), Some theory for Fisher's linear discriminant function, `naive Bayes', and some alternatives when there are many more variables than observations. Bernoulli, 10, 989--1010.
3. Download Tibshirani et al. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99, 6567-6572.
4. Download: Xiaohong Huang and Wei Pan (2003), Linear regression and two-class classification with gene expression data. Bioinformatics, 19, 2072-2078.
5. Download: Dudoit S., Fridlyand J, Speed T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. JASA, 97, 77-87.
6. Download: Fan J, Fan Y (2008). High-dimensional classification using features annealed independence rules. Ann Statist, 36, 2605-2637.
7. Download: Mai, Q., Zou, H., and Yuan, M. (2012). A Direct Approach to Sparse Discriminant Analysis in Ultra-high Dimensions. Biometrika, 99(1), 29-42.
8. Download: Cai T, Liu W. (2011). A Direct Estimation Approach to Sparse Linear Discriminant Analysis. JASA, 106, 1566-1577.
Week 2: Overview (2.1-2.3); Model selection and assessment (2.9, 7.10); read Curse of dimesionality (2.5). Linear models for classification: intro (4.1), linear regression (4.2); notes
HWK1 due on Feb 2. (by the end of the day on Canvas).
Note: It is due by the end of the day on Canvas; you can have separate pages for the answers and code, or you can mark out (or hand-print) your answers with the mixed code and output. Again no late HWK is accepted unless with a prior approval or a legitimate reason (e.g. illness).
1. Download WSJ: Big Data Is on the Rise, Bringing Big Questions. (A subscription may be needed.)
2. Download WSJ: Big Data's Big Problem: Little Talent. (A subscription may be needed.)
3. Download McKinsey Global InstituteJune 2011 Big data: The next frontierfor innovation, competition,and productivity.
4. Download Donoho D. (2015), 50 years of Data Science.
5. Download Breiman L. (2001), Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist. Sci. 16, iss. 3, 199-231.
6. Download Hand, D.J. (2006), Classifier Technology and the Illusion of Progress (with comments and a rejoinder by the author). Statist. Sci. 21, iss. 1, 1-34.
7. Download S. Guha, R. Hafen, J. Xia, J. Rounds, J. Li, B. Xi, and W. S. Cleveland (2012), Large complex data: divide and recombine (D&R) with RHIPE, Stat 1, 53-67.
8. Download Cleveland W.S. (2001, republished 2014), Data science: An action plan for expanding the technical areas of the field of statistics. Statistical Analysis and Data Mining 7, iss. 6, 414-417.
9. Download B. Yu (2014). Let us own data science. Institute of Mathematical Statistics (IMS) Presidental Address, ASC-IMS Joint Conference, Sydney, July, 2014.
10. Download Yang S, et al. (2015). Accurate estimation of influenza epidemics using Google search data via ARGO. PNAS, 112, 14473-8.
11. Download McKinney SM, (2020). International evaluation of an AI system for breast cancer screening. Nature, 577, 89-94.
12. Download Hollon TC, et al. (2020). Near real-time intraoperative brain tumor diagnosis using stimulated Raman histology and deep neural networks. Nat Med, 26, 52-58.
Week 1 (one class on W): Introduction (Chapter 1); notes