The human genome contains 28 million CpGs, which are subject to methylation in a cell type-specific manner. These methylation marks reflect processes that occur along the genome and define the cellular identity of each cell. They are often disrupted in cancer and other diseases, reflecting the misregulation of disease-associated genes.
Adjacent CpG sites tend to be methylated similarly. We propose a probabilistic approach to model the genome-wide segmentation of the genome into highly correlated haplotype blocks. DNA methylation sites within a block share similar methylation status in multiple cell types.
Using this compact model, we developed an efficient dynamic programming algorithm for segmenting the genome, and applied it to dozens of Whole-Genome Bisulfite-Sequenced (WGBS) datasets from pure cell types. This segmented the genome into ~6 million homogenous blocks of size >300bp, and identified blocks which are methylated in a cell type-specific manner in health and disease, thus overcoming noise from individual differential CpG sites.
This vast number of differentially methylated blocks paves the way for improved methylation-based analysis and early cancer detection, cell-free DNA deconvolution, and other classification challenges, in terms of sensitivity and robustness.
Finally, we present a probabilistic algorithm to infer the relative composition of circulating DNA from various tissues, by estimating the posterior probability of cfDNA fragments originated from differential tissues and cell types.