We are concerned with clustering and characterising gene expression sequences that have been classified according to heterogeneous classification schemes. We adopt a model-based approach that uses a Hidden Markov Model (HMM) that has as states the stages of the underlying process that generates the gene sequences, thus allowing us to handle complex and heterogeneous data. Each cluster is described in terms of a HMM where we seek to find schema mappings between the states of the original sequences and the states of the HMM.The general solution that we propose involves several distinct tasks. Firstly, there is a clustering problem where we seek to group similar sequences; for this we use mutual entropy to identify associations between sequence states. Secondly, because we are concerned with clustering heterogeneous sequences, we must determine the mappings between the states of each sequence in a cluster and the states of an underlying hidden process; for this we compute the most probable mapping. Thirdly, using these mappings we employ maximum likelihood techniques to learn the probabilistic description of the hidden Markov process for each cluster. Fourthly, we use these descriptions to characterise the clusters using Dynamic Programming to determine the most probable pathway for each cluster. Finally, we derive linguistic labels to describe the clusters in a user-friendly manner. Such an approach provides an intuitive way of describing the underlying shape of the process by explicitly modelling the temporal aspects of the data. Non time-homogeneous HMMs are used to capture the full temporal semantics.
|Journal||Artificial Intelligence Review (Special Issue on Life Science and AI)|
|Publication status||Published - 1 Oct 2003|
Bibliographical noteOther Details
This paper describes a method of clustering and characterising gene expression sequences, classified according to heterogeneous classification schemes. We adopt a model-based approach using a Hidden Markov Model with states the stages of the underlying process, thus allowing us to handle complex and heterogeneous sequences. The approach was developed as part of the EU-IST MISSION project. It is published in a special issue of the AI Review that includes data mining methods for bioinformatics. The concepts are being used and further developed in the EPSRC and HPSSNI-funded RIGHT project, which is developing model based clustering algorithms for patient pathway sequences.