Modal Inference and Its Application to High-Dimensional Clustering
Surajit Ray
Department of Mathematics and Statistics
Boston University
Wednesday, October 31, 2007
3:30pm
MoosT 1-450G
Minneapolis Campus
Abstract:
Multivariate mixtures provide flexible methods for both fitting and partitioning
high-dimensional data. Ray and Lindsay (2005) show that the topography of multivariate
mixtures, in the sense of their key features as a density, can be analyzed rigorously
in lower dimensions by use of a ridgeline manifold that contains all critical
points as well as the ridges of the density. To use this rich feature for data
analysis we first construct an extension of EM algorithm that can be used to
find the modes of a mixture density. Even in very high dimensions the computational
complexity of our EM algorithm is extremely low. In addition, the method of
steepest ascent can be used to assign the individual data points to modes, providing
a clustering of data points through their modal association.
These tools can be used in various ways. For one, we can take a conventional
mixture analysis and cluster together those components whose contribution is
actually unimodal. This cluster could then represent a single true component
with a more complex distribution. We can also turn kernel density estimation
into clustering tool in which the data points become identified with each other
by their association with a common mode of the density estimator. If in addition
we let the bandwidth parameter go from 0 to infinity, we can construct a hierarchical
clustering of the data points. In addition to providing satisfying clustering
results that lie somewhere between clustering algorithms and a formal mixture
analysis, the estimation method raises interesting inferential questions that
lie somewhere between the two points of view.
Application of modal clustering will be discussed in the context of image segmentation.
Co-authors: Bruce Lindsay and Jia Li, Penn State University