In many biological systems, we can view samples as non-negative linear combinations of unknown components. Extracting these components provides insights into the underlying processes of the system. One recent example is cell-free ChIP-seq (cfChIP-seq) data, which combines chromatin modification programs originating from various populations of cells, that differ in type and state.
The decomposition task is typically approached using the Non-negative Matrix Factorization (NMF) model, resulting in estimations of mixture weights and constant components. However, in many cases, and specifically in cfChIP-seq data, there is possible variation between samples that is not explained entirely by the sample-specific mixture weights, but by a variation between samples in the components` signal itself.
Here, we account for this variation by introducing a probabilistic model named varNMF, wherein components are modeled as random variables instead of constant vectors.
With simulations, we show that in the presence of component variation, varNMF outperforms the generalization abilities of NMF. We demonstrate its real-world potential by applying it to cfChIP-seq samples from healthy donors and metastatic colorectal carcinoma (CRC) patients. VarNMF shows a significant advantage in generalization over NMF, suggesting the existence of per-sample variation in cell-type signals. VarNMF infers components associated with cell-types expected in this dataset, without any prior knowledge. It also estimates each component’s contribution to each sample. Furthermore, with varNMF we can observe component variability between samples directly, and indicate, for example, differential disease behavior.
Taken together, these results illustrate the potential of varNMF as a tool for component decomposition in biological contexts.