Data pre-processing. Analyzed datasets were retrieved from Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) databases. Phenotypic information was manually reviewed and only samples from oral cavity, stage T1 or T2, HPV negative, primary human oral squamous cell carcinoma were considered. Normalized gene expressions were processed using the ComBat algorithm for removing batch effects [PMID:16632515].
Classifier development. Our final goal was to develop a classifier to predict node status in clinical settings using RT-PCR. Hence, we normalized our training data accordingly, using the GAPDH housekeeping gene as reference. For classification purposes, we used the “k Top Scoring Pairs” (kTSP) algorithm, which allows sample classification based on the aggregation of votes resulting from expression ordering within a defined set of gene pairs [PMID:25262153]. In order to avoid overfitting, we restricted statistical learning to pairs combining genes promoting metastasis with genes preventing it. We further required our final classifier to be parsimonious (i.e., no more than 6 disjoint pairs), biologically consistent (i.e., higher expression of pro-metastatic genes in node positive patients), and robust across platforms (i.e., based on the intersection of the top scoring pairs identified by RNA-seq and microarray). Before independent validation via RT-PCR, we locked the decision rule, maximizing specificity and sensitivity: a sample was classified as node negative if three or more pairs voted for node negative status.
Classifier validation. We validated our classifier using an independent set of 38 patient samples using RT-PCR and 32 Patient Derived Xenografts (PDXs) using RNA sequencing.