Characterizing cell types using the Benford law

Separation of living cells and identification of their cell-type is essential for understanding their functionality and is relevant in medical applications. With the development of high-throughput sequencing technologies in general and single cell sequencing in particular, novel methods for identification of cell type subpopulations emerged. The final step in this process is to assign cell type labels to cell clusters. This step is done by using known gene expression markers or expression signature for cell types of interests. The clusters’ labelling assignment is a time-consuming process, highly dependent on the quality of the gene signature used. Here we introduce a novel algorithm for cell type labelling. Our method uses the Benford law, which states that within a large numerical data the leading digit’s occurrence probability drops as its value increases, to detect a set of genes whose first digit distribution accurately distinguish between different cell types and can be used as input for machine learning based cell-type labelling. Our algorithm does not require statistical testing in the gene selection process and is simple and straightforward, requiring only the Benford adherence of genes for feature selection. Moreover, despite the simplicity of this novel feature-selection method, its separation accuracy is comparable to the differential expression approach. Thus, the BL can be used to obtain biological insights from massive amounts of numerical genomics data - a capability that could be utilized in various biomedical applications, e.g., to resolve samples of unknown primary origin, identify possible sample contaminations, and provide insights into the molecular basis of cancer subtypes.