Predicting T-cell epitope interactions using convolutional neural networks

Liel Cohen-Lavi ^1,2 Philip Bradley ^6,7 Andrew Fiore-Gartland ⁵ Jeremy Chase Crawford ⁴ Paul Thomas ⁴ Aharon Bar-Hillel ¹ Tomer Hertz ^2,3,5

¹Industrial Engineering and Management, Ben Gurion University of the Negev, Israel
²National Institute for Biotechnology in the Negev, Ben Gurion University of the Negev, Israel
³Microbiology and Immunology, Ben Gurion University of the Negev, Israel
⁴Immunology, St. Jude Children’s Research Hospital, USA
⁵Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, USA
⁶Public Health Sciences Division, Fred Hutchinson Cancer Research Center, USA
⁷Institute for Protein Design, University of Washington, USA

T-cells circulate the body and interact Major Histocompatibility Complex (MHC) proteins, present on the surfaces of cells that are bound to peptides sampled from healthy and pathogenic proteins. The ability to predict if a given T-cell receptor (TCR) will bind a specific pMHC is limited for several reasons: (1)A single TCR can interact with multiple pMHC complexes; (2)Many TCRs with varying degrees of similarity bind to a single pMHC; (3)TCRs vary in the length of their complementary determining regions (CDRs). While previous datasets included only information on the TCR beta chain, recent advances in single-cell sequencing have allowed extracting paired alpha-beta TCR sequences for antigen specific T-cells. These technologies will enable the rapid characterization of thousands of TCR sequences of both Naive and memory T-cells.

We characterized patterns underlying TCR:pMHC interactions, using machine learning models. We used convolutional Neural Networks for decomposition of the TCR sequences into k-mers. The networks were trained on TCR sequences from epitope-specific TCRs and sequences of background TCRs. Each TCR was represented using the CDR3 k-mer occurrences and their locations. We used publicly available data for 10 epitope specific TCR repertoires. We found that the k-mer based method outperformed the current state-of-the-art TCR prediction methods in 7/10 epitopes, resulting in an area under the receiver operating curve of 0.89 on average. Importantly, our method allows the extraction of predictive local sequence motifs in the TCR sequence. Such biologically interpretable models may provide a roadmap for building a generalized predictive model of the TCR:pMHC interaction.