Spring 2006
May 5, 2006: Feng Tai
Incorporating Biological Knowledge into Tumor Classification by Introducing Multiple Shrinkage Parameters in Penalized Partial Least Squares
In the context of sample (e.g. tumor) classifications with microarray gene expression data, many methods have been proposed. However, almost all the methods ignore existing biological knowledge, and treat a priori all the genes equally. On the other hand, some genes are known to have biological functions or to be involved in pathways related to the disease, and thus these genes are likely to be more relevant. Here we propose incorporating such biological knowledge into building a classifier to improve interpretability and prediction performance of the resulting model.
Apr 28, 2006: Peng Wei
Incorporating gene functions into regression analysis of DNA-protein binding data and gene expression data to construct transcriptional networks
Useful information on transcriptional networks has been extracted by regression analyses of gene expression data and DNA-protein binding data. However, a potential limitation of these approaches is their assumption on the common and constant activity level of a transcription factor(TF) on all the genes in any given experimental condition; for example, any TF is assumed to be either an activator or a repressor, but not both, while it is known that some TFs can be dual regulators. Rather than assuming a common linear regression model for all the genes, we propose using separate regression models for various gene groups; the genes can be grouped based on their functions or some clustering results. Furthermore, to take advantage of the hierarchical structure of many existing gene function annotation systems, such as Gene Ontology(GO), we propose a shrinkage method that borrows information from relevant gene groups. Applications to a yeast dataset and simulations lend support for our proposed method. In particular, we find that the shrinkage method consistently works well under various scenarios. We recommend the use of the shrinkage method as a useful alternative to the existing method.
This was jointly done with Dr. Wei Pan.
Apr 21, 2006: Baolin Wu
Margin-based classifiers with L1 regularization
Margin has been recognized as an important concept in building classifiers for high-dimensional data, where it is common to observe n<<p. The commonly used SVM can be motivated from margin-maximizing hyperplane in high-dimensional space, with L2 penalty used to regularize the model fitting. Margin-based classifier with L2 penalty does not have built-in feature selection, and in principle all features are used in the model fitting. While L1 regularization has automatic feature selection (sparse) property, which is especially useful for high-dimensional data modeling. Recently regression using relaxed L1 penalty has been studied and shown to have some advantage over LASSO in high-dimensional regression. In this talk we will discuss a class of margin-based classifiers using L1 and relaxed L1 penalty. The main focus is on developing efficient algorithms to compute the entire regularization solution paths to facilitate the model fitting. Application to public microarray data and simulation studies will be used to illustrate the methods.
Apr 14, 2006: Saonli Basu
Linkage Analysis of Complex Traits: GAW 15 Problem 2
Rheumatoid arthritis (RA) is an autoimmune disease. Family and twin studies suggest that the genetic component to RA susceptibility is around 60%. There is enough evidence through different linkage and association studies that multiple interacting loci influence the risk for RA, but so far the involvement of major histocompatibility complex (MHC) has been the only consistent finding in genetic studies. However, MHC is estimated to contribute only 30-40% of the total genetic susceptibility component for the disease.
Problem 2 of Genetic Analysis Workshop (GAW) 15 gives us an excellent opportunity to try to understand the etiology of RA. By analyzing this GAW 15 data, we can familiarize ourselves with different methodologies and currently available softwares for genome-wide linkage and/or association approaches. Also this dataset gives us an opportunity to develop methods to model multiple interacting loci together to do a genome scan for RA. I will talk about the findings for RA from different linkage/ association studies. I will briefly describe the GAW 15 problem 2 dataset and discuss about different techniques we can use to analyze the data.
Apr 07, 2006: Na Li
An Expression of Linkage: GAW 15 Problem 1
The purpose of this talk is to discuss coordinating an approach to a problem proposed for the current Genetic Analysis Workshop. Please forward this to any parties that may be interested. This is the first time that a gene expression data set has been part of the workshop.
The Genetic Analysis Workshop (GAW) is a collaborative effort among genetic epidemiologists to evaluate and compare statistical genetic methods. For each GAW, topics are chosen that are relevant to current analytical problems in genetic epidemiology, and sets of real or computer-simulated data are distributed to investigators worldwide. Results of analyses are discussed and compared at meetings held in even-numbered years.
Feb 17, 2006: Yan Zheng
Normalization of microarrays in transcription inhibition experiments
Almost all of the existing methods for normalization assume that not too many of the genes differ in expression levels across arrays. Hence when the level of expression for many genes is not roughly constant across arrays, these standard methods are inappropriate. Here we develop a model for normalization in the context of an experiment that attempts to measure mRNA halflifes by stopping transcription and then measuring gene expression at certain later times. This model does not assume that most genes are constant across arrays, but rather assumes some genes have long halflifes. By supposing there are genes with long halflifes relative to the duration of the experiment, the model allows estimation of normalizing terms. Certain weaknesses of the basic model are noted, and a more sophisticated model is developed that addresses these shortcomings.
Feb 10, 2006: Wei Pan
Incorporating Gene Functional Annotations in Detecting Differential Gene Expression
The importance of incorporating existing biological knowledge, such as gene functional annotations in Gene Ontology (GO), in analyzing high-throughput genomic and proteomic data is being increasingly recognized. In the context of detecting differential gene expression, however, the current practice of using gene annotations is limited primarily to validations. Here we take a direct approach of incorporating gene annotations into mixture models for analysis. First, in contrast to a standard mixture model assuming that each gene of the genome has the same distribution, we study stratified mixture models allowing genes with different annotations to have different distributions, such as prior probabilities. Second, rather than treating parameters in stratified mixture models independently, we propose a hierarchical model to take advantage of the hierarchical structure of most gene annotation systems, such as GO. We consider a simplified implementation for the purpose of proof-of-concept. An application to a mouse microarray dataset and a simulation study demonstrate the improvement of the two new approaches over the standard mixture model.
Feb 3, 2006: Yang Xie
A Bayesian approach to joint modeling of DNA-protein binding data, gene expression data and DNA sequence data
Accurate identification of the genes whose transcription is controlled by a specific regulator is a crucial step towards understanding gene regulation on a genome-wide scale and deciphering the principles of regulatory networks. Exploration of the structure and function of such networks is regarded as a fundamental problem for the coming decades. The genome-wide maps of transcriptional regulators (DNA-protein binding data), DNA sequence data and expression data obtained using whole-genome DNA microarrays represent complementary means to deciphering global and local transcriptional regulatory circuits. Combining these different types of data can not only improve the statistical power, but also provide a more comprehensive picture of gene regulation. In my work, I propose a joint model to combine DNA-protein binding data, gene expression data and DNA sequence data. I specify hierarchical Bayes models and use Markov chain Monte Carlo simulation to draw the inferences. Both the simulation studies and analysis of experimental data show that the proposed joint modeling method can significantly improve the specificity and sensitivity of identification of target genes as compared to conventional approaches relying on a single data source.
Jan 27, 2006: Guanghua Xiao
Hierarchical Bayesian model for gene expression and function annotation data.
A key question in cDNA microarray experiments is which genes are differentially expressed between two experimental conditions. Most statistical methods assume that genes' expression levels are independent with each other. Many studies have shown that a gene's expression level tends to be similar to the other genes within same functional group, and genes in the same functional group are likely to be differentially expressed together. We propose a hierarchical Bayesian model that jointly models gene expression data and gene function annotation data. Both simulation studies and an analysis of experimental data show that the proposed method outperforms the SAM statistic, which is widely used in detecting differentially expressed genes.
Jan 20, 2006: Cavan Reilly
Statistical Issues understanding mRNA decay
Understanding the mechanisms contributing to the regulation of gene expression is a fundamental goal of molecular biology. Such an understanding could form the basis for treating many disorders, such as cancer. Many mechanisms play a role in the regulation of gene expression. Here we discuss several statistical problems that stem from experiments aimed at understanding the role of mRNA decay in the regulation of gene expression. These problems range from the most basic, low level, analysis tasks to elucidating genetic networks involving many genes. In this talk we will focus on posing these questions, and discuss some preliminary findings that reflect the complexity of the genomics of mRNA regulation.


