Group for Research in Decision Analysis


Metabolic data learning: Forestogram using spike-and-slab models


In many applications, such as metabolomics, data are composed of several continuous measurements of subjects (tissues) over multiple variables (metabolites). Measurement values are put in a matrix with subjects in rows and variables in columns. The analysis of such data requires grouping subjects and variables to provide a primitive guide toward data modelling. A common approach is to group subjects and variables separately, and construct a clustering tree once on rows and another time on columns. This simple approach provides a grouping visualization through two separate trees, which is difficult to interpret jointly. Another approach is to partition the matrix to provide a joint clustering, but this method looses the visualization tool being attractive for biologists. We propose a binary tree built on the matrix directly, thus providing a collection of three-dimensional trees that we call forestogram. We propose a hierarchical spike-and-slab model to provide a robust clustering in the presence of noise. Furthermore, we suggest an extension of the model that quantifies discriminant rows and columns. The log posterior is encouraged to be used as the similarity measure for comparing groupings and building the forestogram. As a consequence, the biclustering algorithm becomes fully automated. We apply our proposed method on real metabolomic measurements.

, 17 pages