Sequence Pattern Discovery under a Stochastic Dictionary Framework

Mayetri Gupta
Department of Statistics
Harvard University

Monday, March 3, 2003
3:30 PM
Moos 2-620
Minneapolis Campus

Abstract:
Accurate identification of transcription factor binding sites (motifs), short conserved patterns of 6-20 nucleotides, in DNA sequences is the first step in acquiring a detailed understanding of gene regulation. Motif discovery is complicated by the fact that patterns are of unknown length, may have insertions or deletions, the total number of patterns that are present is unknown and varies widely, and the sequences contain noise in the form of low-complexity repeats that have no biological significance. We develop a novel framework for sequence pattern identification, extending the idea of a dictionary model (Bussemaker et al., 2000). In our initial model, patterns and single letters are assumed to be independently generated from an unknown dictionary of stochastic words and concatenated to form the sequence. Data augmentation algorithms are developed that iteratively update the pattern composition and locate pattern sites on the sequences. The methodology is extended to find patterns of unknown widths, and patterns with insertions and deletions. In eukaryotic genomes, motif detection is a more challenging problem as the patterns tend to be shorter, less well-conserved, and occur in multi-pattern clusters (regulatory modules). We extend the stochastic dictionary model to a module framework under which an evolutionary Monte Carlo-based state-space model selection algorithm is constructed. This enables us to identify an optimal set of patterns and obtain improved parameter estimates. The performance of these new methods will be demonstrated by both simulation studies and applications to bacterial and human genomes.