Student Seminar is a bi-weekly seminar series that focuses on student research in the Biostatistics division. Our goal is to enhance students' public speaking abilities in a casual, student-only environment. Both PhD students and MS students are encouraged to present! Presentations can be on a variety of topics, including:
- Your own research! A good opportunity to do a test run of a research presentation.
- Share your skills! In the past, students have presented on useful R packages or their experiences with the job application process.
- Journal-club format - select a paper by someone who is giving a Wednesday Division of Biostatistics seminar (check out the schedule) and introduce us to the topics, with a short discussion!
This school year marks the 5th year of student seminars for the Biostatistics division. The organizers are Chuyu Deng and Shannon McKearnan. If you have any feedback or ideas for future talks, please let us know!
The website will be updated as the year progresses.
Please note that the Spring 2020 seminars will be now be held via Zoom due to COVID-19, and the meeting link will be emailed out each time.
Click the date/title to reveal each presentation abstract.
One of the most time-consuming tasks statisticians face is formatting tables and figures to meet journal
requirements or appeal to client needs. For this talk, I’ll present two tools to make your job easier,
regardless of career path! SAS ODS in an output delivery system, similar to R markdown, that allows you
to send your tables and graphics to a document with very little effort. (No more using the snipping tool
to copy/paste your output!). Similarly, the R package officer allows you to add or remove tables and
graphs from Microsoft documents. The package differs from R markdown in that the plots and figures
exported remain editable in the new documents, making it easy for collaborators or clients to adjust the
figures to their liking.
Slides and related materials are available in the Resources section below.
Functional magnetic resonance imaging (fMRI) data have become increasingly available and are useful for describing functional connectivity (FC), the relatedness of neuronal activity in regions of the brain. This FC of the brain provides insight into certain neurodegenerative diseases and psychiatric disorders, and thus is of clinical importance. To help inform physicians regarding patient diagnoses, unsupervised clustering of subjects based on FC is desired, allowing the data to inform us of groupings of patients based on shared features of connectivity. Even within these groups of patients, heterogeneity across patients in FC is still present. As such, it is important to allow subject-level differences in connectivity, while still pooling information across patients within each group to describe group-level FC. To this end, we propose a random covariance clustering model (RCCM) to concurrently cluster subjects based on their FC networks, estimate the unique FC networks of each subject, and to infer shared network features. Although current methods exist for estimating FC or clustering subjects using fMRI data, our novel contribution is to cluster or group subjects based on similar FC of the brain while simultaneously providing group- and subject-level FC network estimates. The competitive performance of RCCM relative to other methods is demonstrated through simulations in various settings, achieving both improved clustering of subjects and estimation of FC networks. Utility of the proposed method is demonstrated with application to a resting-state fMRI data set collected on 43 healthy controls and 61 participants diagnosed with schizophrenia.
Recently, spinal epidural neurostimulation is being considered for rehabilitation of persons suffering from partial spinal cord injury. The neurostimulator must be programmed by a neurosurgeon, yet little work has been done to develop rigorous methods for optimally programming the device. We propose an adaptive design to efficiently optimize programming of the neurostimulator based on monthly evaluations of patient reported preferences. Preferences for the eligible device configurations are estimated after each month through a conditionally auto-regressive model that assumes preference for one configuration is related to preferences for neighboring configurations. Using the adaptively updated preferences, a group of configurations is programmed into the device for the patient to evaluate during the next month. This selection is based on a balance of device exploration and preference maximization. We repeat this process until a specified stopping rule or the trial-end is reached. We show simulation studies to evaluate the overall quality of the adaptive trial for various configuration selection strategies and the effects of stopping the trial early.
Shannon McKearnan, Chuyu Deng, Mengli Xiao
This presentation will be on how we use the Minnesota Supercomputing Institute (MSI) for parallelizing our work and how you can too! We will go through an example workflow for beginner users, and we'll also have additional tips and tricks for more experienced users.
Adaptive interventions/dynamic treatment regimes provide a method to individualize sequences of treatment. However, critical questions must be answered to develop a quality adaptive intervention (ie. should a treatment be augmented, switched, or continued if an individual does not initially respond?). Additionally, missing data is a common predicament of many studies, but the nature of missing data in studies with adaptive interventions invites a unique set of challenges. This talk will give a brief introduction to SMART (sequential multiple assignment randomized trials) studies and their role in building adaptive interventions, missing data mechanisms, methods to handle missing data and the nature of missing data in SMART studies. Given that SMARTs were only introduced into the mainstream less than a decade ago, there is plenty of room for biostatisticians to get to work on this novel area of research!
Thomas Murray, Jeff Boatman, Sarah Boatman, Ziyu Ji
Join us at Student Seminar this week for a very exciting early career panel! We will have representatives with experience working in academia and industry with PhD and MS degrees on our Zoom call to talk about their experiences working and applying for jobs post-graduation and answer your burning questions.
April 30, 2020 - Estimating longitudinal causal effects with participant noncompliance and non-normal confounders
Participant noncompliance, in which participants do not follow their assigned treatment protocol, often obscures the causal relationship between treatment and treatment effect in randomized trials. In the longitudinal setting, the G-computation algorithm can adjust for confounding to estimate causal effects. Typically, G-computation assumes that compliance is known and that the confounders are normally distributed. We aim to develop a G-computation estimator in the setting where both assumptions are violated. In place of compliance, we substitute in probability weights derived from modeling a biomarker associated with compliance. We specify the joint conditional confounder density as a factorization. To generate random samples of the non-normal confounders, we use predictive mean matching, in which the predicted values of the model fit given randomly generated data are matched with similar observed values. In simulation and application, we compare multiple causal estimators already established in the literature with those derived from our method.
Past Skillshare Presentations
- Using the student servers
- Using Rcpp to speed up R
- Using Github with RStudio
- SAS ODS & R Officer
- Parallel Computing with MSI
- Alternate method for parallel computing
External Funding Opportunities
- Doctoral Dissertation Fellowship
- Interdisciplinary Doctoral Fellowship
- MnDRIVE PhD Graduate Assistantship Program
- Interdisciplinary Biostatistics Training Grant in Genetics and Genomics
- ASA Internships - The internship list for summer 2020
- UFL Jobs - Mostly faculty and post docs
- UW Jobs - Mostly faculty and post docs
- ASA JobWeb - The ASA list of bio/stats related jobs
- Purdue Jobs - Jobs from universities and industry
- ISBA Jobs - Bayesian friendly positions
- IMSTAT Jobs - Several faculty listings
- USAJobs - Listings Mostly Census, DHHS, DHS, NSF, and DOE
This week's seminar will be held in Mayo 1250.
R Shiny is an R package that allows you to make interactive web-apps that run R code in the background.
They can be very useful for sharing statistical results and allowing collaborators to explore your results.
In this seminar, I will go over the basics of what a shiny app is, how to build a shiny app, and how to share
a shiny app. I will also share an example of a shiny app I built for fantasy football projections along with
other apps that are available on the internet.
This week's seminar will be held in Mayo 3-125. To begin, this talk is geared towards Master's students who are eager to begin a consulting experience or are
somewhat unhappy with the apparent lack of opportunities in statistical consulting displayed to Master's students. This talk also applies to Ph.D. students as well.
Initially during the first semester of my Master's years here at the U of M Biostats, I grew tired of taking only classes. Additionally, my online TA-ing under the
phenomenal instructors added little-to-no improvement in my statistician-to-people skills and left my hands empty of real data. Throughout classes we were gifted "clean"
data pruned for an already planned model being taught in the course. Sadly, there is only one course that teaches you Statistical Consulting and how to properly carry
yourself as a statistical consult, taught by Ann Brearley and Kyle Rudsor PUBH 7465 - usually taught in your second year's second semester. The problem with this timing is
that you, as a Master's student (speaking from my perspective), have little experience with clients as well as discussing models in a non-statistical way to
non-statistically leaning people. Additionally, the research experiences given to Master's students does not equate the "research experience" quota that many job/internship
applications request. In this talk, I will discuss various approaches, problems, and treatments to these problems, for a first statistical consulting opportunity. If you
have any suggestions to be added to the presentations or preset questions that I can answer during the presentation, email me at firstname.lastname@example.org. I will gladly try to
answer all questions.
This week's seminar will be held in Mayo 3-125. In the post-GWAS era, colocalization testing has been playing an increasingly important role in inferring causal
genetic variants and genes from GWAS trait-associated loci. However, colocalization testing is challenging. We first discuss some severe limitations of the existing methods,
thus motivating our development of a general and powerful method. We use extensive simulations to demonstrate the advantages of our method over existing methods. We apply
our and other methods (when possible) for colocalization analyses of multiple correlated GWAS traits and that of a GWAS trait and gene expression.
This week's seminar will be held in Mayo 3-125. We aim to concurrently conduct sparse precision matrix estimation and clustering of subjects using a random
covariance method with the EM algorithm. The proposed method is computationally more efficient in the high-dimensional setting than some existing methods due to its use of
the KL Divergence as a dissimilarity measure between matrices rather than an L1 penalty on corresponding matrix entries. The algorithm yields sparse precision matrix
estimates, estimated cluster assignments, and uncertainty measures for these assignments.
Mayo 3-125. Biomedical devices that stimulate the spinal cord have indicated quite greatly beneficial results for paraplegic patients as compared to standard physical
therapies. These devices have many grids of values such as the placement along the lower spine, the pulse width, and the frequency. However, there is an apparent scarcity
in the literature on implementing a Phase-I Adaptive clinical trial to assess the grid of setting values for these Spinal Cord Epidural Stimulation (SCES) devices in the
case of one patient. In this presentation, we disclose a possible design for a Phase-I Adaptive clinical trial that effectively provides the patient with monthly sets of
settings to explore and considers an early stopping rule for the trial.
April 1, 2019 - Semiparametric modeling of time-varying activation and connectivity in task-based fMRI data
Jun Young Park
Mayo 3-125. In functional magnetic resonance imaging (fMRI), there is a rise in evidence that the temporal change in the synchronization of brain activity, known as dynamic
functional connectivity (dFC) or time-varying connectivity, provides additional information on brain networks not captured by measures of connectivity that is static over
time. While there have been many developments for statistical models for dFC when the study participants are at rest, there remains a gap in the literature on how to
simultaneously model both dFC and time-varying activation when the study participants are undergoing an experimental task designed to probe at a cognitive process of
interest. We propose a method to estimate the dFC between two regions of interest (ROI) in task-based fMRI where the activation effects are also allowed to vary over time.
Our method uses penalized splines to model both time-varying activation effects and time-varying connectivity, and uses the bootstrap for statistical inference. We validate
our approach using simulations and show that ignoring time-varying activation effects would lead to poor estimation of dFC. Our proposed model, called TVAAC (time-varying
activation and connectivity), can estimate the both static and time-varying activation and functional connectivity. We give an empirical illustration of both time-varying
activation and connectivity by using our proposed method to analyze two subjects in an event-related fMRI learning experiment.
April 15, 2019 - Iterated Multi-Source Exchangeability Models for Individualized Inference with an Application to Mobile Sensor Data
Mayo 3-125. Researchers are increasingly interested in using sensor technology to collect accurate activity information and make individualized inference about treatments,
exposures, and policies. How to optimally combine population data with data from an individual remains an open question. Multi-source exchangeability models (MEMs) are a
Bayesian approach for increasing precision by combining potentially heterogeneous supplemental data sources into analysis of a primary source. MEMs are a potentially
powerful tool for individualized inference but can integrate only a few sources; their model space grows exponentially, making them intractable for high-dimensional
applications. We propose iterated MEMs (iMEMs), which identify a subset of the most exchangeable sources prior to fitting a MEM model. iMEM complexity scales linearly with
the number of sources, and iMEMs greatly increase precision while maintaining desirable asymptotic and small sample properties. We apply iMEMs to individual-level behavior
and emotion data from a smartphone app and show that they achieve individualized inference with up to 99% efficiency gain relative to standard analyses that do not borrow
Mayo 3-125. Abstract: Sometimes R code just isn't fast enough. You've tried everything in R to increase performance...but it's still too slow!
The package Rcpp allows you to rewrite key functions in C++ in order to eliminate bottlenecks in your R code (loops, recursive functions, etc.).
This is particularly useful if you want to write your own R package. In fact, as of May 2017, 1,026 packages on
CRAN and a further 91 on
BioConductor deploy Rcpp to accelerate computations and to connect to other C++ projects.
I'll provide a gentle introduction to incorporating C++ into your own R code and into your R packages. I'll also point out great resources that go into much greater detail
into all aspects of Rcpp. Note: you do not need any background in C++ to get started using Rcpp! All coding levels are welcome!
Jun Young Park
Mayo 3-125. Advances in molecular "omics'' technologies have motivated new methodology for the integration of multiple sources of high-content biomedical data. However,
most statistical methods for integrating multiple data matrices only consider data shared vertically (one cohort on multiple platforms) or horizontally (different cohorts
on a single platform). This is limiting for data that take the form of bidimensionally linked matrices (e.g., multiple cohorts measured on multiple platforms), which are
increasingly common in large-scale biomedical studies. In this paper, we propose BIDIFAC (Bidimensional Integrative Factorization) for integrative dimension reduction and
signal approximation of bidimensionally linked data matrices. Our method factorizes the data into (i) globally shared, (ii) row-shared, (iii) column-shared, and (iv)
single-matrix structural components, facilitating the investigation of shared and unique patterns of variability. For estimation we use a penalized objective function that
extends the nuclear norm penalization for a single matrix. As an alternative to the complicated rank selection problem, we use results from random matrix theory to choose
tuning parameters. We apply our method to integrate two genomics platforms (mRNA and miRNA expression) across two sample cohorts (tumor samples and normal tissue samples)
using the breast cancer data from TCGA.
Background/Aims: Participant noncompliance, in which participants do not follow their assigned treatment protocol, has long complicated the interpretation and
randomized clinical trials. No gold standard exists for detection of noncompliance, but participants biomakers can suggest exposure to non-study treatments. However,
existing methods can only detect noncompliance based on a single biomarker measurement. We propose a novel method that uses longitudinal biomarker data to model compliance
across time when compliance is unobserved. Conditional on longitudinal biomarker data, our method can estimate: 1) the probability of compliance at a single time point of
the trial; 2) the probability of compliance at all time points; and 3) the prediction probability of compliance at a future time point.
Methods: We model the joint distribution of the biomarker as a mixture density across time points, in which joint compliance probabilities serve as the weights and
biomarker densities conditional on compliance serve as the components. To derive the mixture density, we assume that both compliance and the biomarker were generated from
corresponding mixed effects models. Modeling the biomarker as a mixture density allows us to calculate compliance probabilities that condition on the longitudinal biomarker
data. To evaluate the accuracy of the compliance probabilities, we conduct a Monte Carlo simulation study across three different effects of compliance on the biomarker. We
compare probability estimators 1) and 2) to those that ignore the longitudinal correlation in the data according to AUC. As 3) does not have a naive comparator, we plotted
its calibration lines.
Results: Across all three compliance effects on the biomarker, conditioning on the longitudinal biomarker data uniformly raised AUC. For estimating the probability
compliance at a single time point, conditioning on participants' full biomarker history increased AUC by 4-5 percentage points relative to only conditioning on their most
recent biomarker measurement. For full compliance, adjusting for the longitudinal data correlation boosted AUC by 8-10 percentage points relative to ignoring the
correlation. The calibration lines for the prediction of compliance closely approximated perfect calibration.
Conclusion: Compared to existing methods that can only use a single biomarker measurement, our method can use all biomarker measurements to more accurately identify
noncompliant participants. Our method can also use participants' biomarker history to predict compliance at a future time point.
Background/Aims: Participant noncompliance, in which participants do not follow their assigned treatment protocol, has long complicated the interpretation and conduct of randomized clinical trials. No gold standard exists for detection of noncompliance, but participants biomakers can suggest exposure to non-study treatments. However, existing methods can only detect noncompliance based on a single biomarker measurement. We propose a novel method that uses longitudinal biomarker data to model compliance across time when compliance is unobserved. Conditional on longitudinal biomarker data, our method can estimate: 1) the probability of compliance at a single time point of the trial; 2) the probability of compliance at all time points; and 3) the prediction probability of compliance at a future time point.
Methods: We model the joint distribution of the biomarker as a mixture density across time points, in which joint compliance probabilities serve as the weights and joint biomarker densities conditional on compliance serve as the components. To derive the mixture density, we assume that both compliance and the biomarker were generated from corresponding mixed effects models. Modeling the biomarker as a mixture density allows us to calculate compliance probabilities that condition on the longitudinal biomarker data. To evaluate the accuracy of the compliance probabilities, we conduct a Monte Carlo simulation study across three different effects of compliance on the biomarker. We compare probability estimators 1) and 2) to those that ignore the longitudinal correlation in the data according to AUC. As 3) does not have a naive comparator, we plotted its calibration lines.
Results: Across all three compliance effects on the biomarker, conditioning on the longitudinal biomarker data uniformly raised AUC. For estimating the probability of compliance at a single time point, conditioning on participants' full biomarker history increased AUC by 4-5 percentage points relative to only conditioning on their most recent biomarker measurement. For full compliance, adjusting for the longitudinal data correlation boosted AUC by 8-10 percentage points relative to ignoring the correlation. The calibration lines for the prediction of compliance closely approximated perfect calibration.
Conclusion: Compared to existing methods that can only use a single biomarker measurement, our method can use all biomarker measurements to more accurately identify noncompliant participants. Our method can also use participants' biomarker history to predict compliance at a future time point.