Abstract
Since expression data make it possible to view expression levels of thousands of genes at a time, many researchers have tried to build a transcriptional regulatory network based on the expression data. The skeleton of the transcriptional network is a binary matrix B with genes as a row and TFs as a column whose (i,j) element is 1 if the TF j binds gene i. Due to the intrinsic limitation of the expression data, it is important to properly integrate other available information. ChIP-chip location data provide direct evidence for the relation between regulators and regulated genes. Our method aims at finding the binary matrix B by integrating location data and expression data. The binary matrix B is given the likelihood P(B) calculated from the location data. B partitions genes into modules and a background according to the binding pattern. We define the measure of quality for the set of modules and a background by the cross validated likelihood of expression profiles measuring the compactness of modules, P(partition|B). We estimate each genes likelihood from other genes belonging to the same module using kernel density estimate. We find a binary matrix B which maximizes the joint probability P(B, partition) = P(B) · P(partition|B). Since finding B maximizing the joint probability is a NP hard problem, we developed an algorithm finding an approximate solution in a linear time of the number of genes. We applied our algorithm to simulated data and the result showed that if the data are generated according to our assumption, our algorithm performs quite well. After applying our algorithm to the cell cycle data from Spellman et al. (1998) and location data from Lee et al. (2002), we validated the obtained gene modules using GO annotation. We compared it with gene modules obtained from four different methods : using expression data only, location data only, GRAM algorithm and ReMoDiscovery. The modules obtained from our algorithm show the best score with respect to GO annotation.