Mayetri Gupta
Department of Statistics
Harvard University
Monday, March 3, 2003
3:30 PM
Moos 2-620
Minneapolis Campus
Abstract:
Accurate identification of transcription factor binding sites (motifs), short
conserved patterns of 6-20 nucleotides, in DNA sequences is the first step in
acquiring a detailed understanding of gene regulation. Motif discovery is complicated
by the fact that patterns are of unknown length, may have insertions or deletions,
the total number of patterns that are present is unknown and varies widely,
and the sequences contain noise in the form of low-complexity repeats that have
no biological significance. We develop a novel framework for sequence pattern
identification, extending the idea of a dictionary model (Bussemaker et al.,
2000). In our initial model, patterns and single letters are assumed to be independently
generated from an unknown dictionary of stochastic words and concatenated to
form the sequence. Data augmentation algorithms are developed that iteratively
update the pattern composition and locate pattern sites on the sequences. The
methodology is extended to find patterns of unknown widths, and patterns with
insertions and deletions. In eukaryotic genomes, motif detection is a more challenging
problem as the patterns tend to be shorter, less well-conserved, and occur in
multi-pattern clusters (regulatory modules). We extend the stochastic dictionary
model to a module framework under which an evolutionary Monte Carlo-based state-space
model selection algorithm is constructed. This enables us to identify an optimal
set of patterns and obtain improved parameter estimates. The performance of
these new methods will be demonstrated by both simulation studies and applications
to bacterial and human genomes.