Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

Data normalization is a critical step in RNA-seq analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including RPKM, TMM, RLE, quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods that do not account for inter-gene correlation due to co-regulation, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g. ECM genes) are particularly prone to false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods lead to a substantial reduction in GSEA false results while retaining true ones. Alternatively, application of gene-set tests that account for gene-gene correlations also attenuates false positive results, albeit statistical power is reduced as well. We thus recommend inspecting and correcting sample-specific length biases as default steps in RNA-seq analysis pipelines to lessen false interpretation of transcriptomic data.