Most of the genomes that are available and that are rapidly sequenced today are of uncultivable organisms that cannot be cultivated. In such cases it is impossible to perform large scale measurements for understanding the functionality of genes and the interaction between them. Thus, a major objective in the field is the development of efficient algorithms for deciphering this type of information based on the genome alone.
In this study, we describe an algorithm that provides information about the expression levels of genes and the interactions/similarities among genes based only on the genome (and without additional measurements). The algorithm is based on the assumption that related genes tend to share sub-sequences which may be related to common regulatory mechanisms, similar function of the encoded proteins, etc. One output of the algorithm is a network of genes; we show that highly expressed genes tend to be more central in this network. A second output of the algorithm is low dimensional clustering of the genes according to their functionality; we show that this representation enable accurate predictions of gene functions.