To estimate the relative abundance of different cell types in an unbiased manner using bulk RNA data, various computational approaches have been developed during the last decade. However, a comparison that objectively evaluates the performance of these approaches against one another has not been conducted and no publicly available data sets have been generated to serve as ground truths thus far. Here we tested the performance of deconvolution methods by using T cell immunofluorescence quantification of 440 regions from 38 treatment naive ovarian cancer samples as the ground truth. We benchmarked ESTIMATE for total T cell infiltration, and compared CIBERSORT, TIMER, MCP-counter and xCell for cell type specific deconvolution. We also evaluated immune gene sets that were defined based on gene expression of sorted immune populations, as well as immune gene sets based on the Immunological Genome Project database using ssGSEA to calculate normalized enrichment scores of the corresponding cell types. Since the cell type deconvolution methods were developed independently of each other, we generated Consensus gene sets (ConsensusTME) by including the genes that fall in the intersection of different cell types across the different tools could improve the cell deconvolution performance. We selected genes that overlapped between the independent methods (intersection), and finally removed genes whose expression levels positively correlated with tumor purity using TCGA ovarian cancer samples as a reference. ConsensusTME gene sets consistently showed higher positive correlations than the individual methods. To further benchmark the methods and ConsensusTME, we employed TCGA ovarian cancer leukocyte methylation scores. We compared the different methods in an unbiased manner by fitting a multiple linear regression model for each method using the leukocyte methylation score as a response variable and the different cellular scores as explanatory variables, followed by unsupervised nested variable selection. ConsensusTME gene signatures provided the highest adjusted R-squared with fewer cell types selected (Adj. R-squared=0.73, p<2.2e-16), as well as being selected as the simplest and most accurate model to explain leukocyte methylation. In addition, we performed a sensitivity analysis of the leukocyte methylation benchmark, and ConsensusTME was also the best method with the same cells explaining leukocyte methylation selected. Together, these benchmarks show a consistent improvement of leukocyte cell deconvolution provided by the ConsensusTME gene sets in ovarian cancer samples. Pan-cancer validation of ConsensusTME is currently undergoing.