T-cells circulate the body and interact Major Histocompatibility Complex (MHC) proteins, present on the surfaces of cells that are bound to peptides sampled from healthy and pathogenic proteins. The ability to predict if a given T-cell receptor (TCR) will bind a specific pMHC is limited for several reasons: (1)A single TCR can interact with multiple pMHC complexes; (2)Many TCRs with varying degrees of similarity bind to a single pMHC; (3)TCRs vary in the length of their complementary determining regions (CDRs). While previous datasets included only information on the TCR beta chain, recent advances in single-cell sequencing have allowed extracting paired alpha-beta TCR sequences for antigen specific T-cells. These technologies will enable the rapid characterization of thousands of TCR sequences of both Naive and memory T-cells.
We characterized patterns underlying TCR:pMHC interactions, using machine learning models. We used convolutional Neural Networks for decomposition of the TCR sequences into k-mers. The networks were trained on TCR sequences from epitope-specific TCRs and sequences of background TCRs. Each TCR was represented using the CDR3 k-mer occurrences and their locations. We used publicly available data for 10 epitope specific TCR repertoires. We found that the k-mer based method outperformed the current state-of-the-art TCR prediction methods in 7/10 epitopes, resulting in an area under the receiver operating curve of 0.89 on average. Importantly, our method allows the extraction of predictive local sequence motifs in the TCR sequence. Such biologically interpretable models may provide a roadmap for building a generalized predictive model of the TCR:pMHC interaction.