TCR-Epitope Binding Affinity Prediction Task Overview

Definition: T-cells are an integral part of the adaptive immune system, whose survival, proliferation, activation and function are all governed by the interaction of their T-cell receptor (TCR) with immunogenic peptides (epitopes). A large repertoire of T-cell receptors with different specificity is needed to provide protection against a wide range of pathogens. This new task aims to predict the binding affinity given a pair of TCR sequence and epitope sequence.

Impact: An accurate model can help design TCR receptor for effective immunotherapy. It can also unlock a patients’ TCR repertoire, which reflects their immune history and could inform about past and current infectious diseases, vaccine effectiveness or autoimmune reactions.

Generalization: The models are, at very least, expected to generalize to unseen TCRs. But the main challenge of this dataset is to generalize to samples where both epitope and TCR are unseen.

Product: Immunotherapy.

Pipeline: Activity.

Weber et al.

Dataset Description: The dataset is from Weber et al. who assemble a large and diverse data from the VDJ database and ImmuneCODE project. It uses human TCR-beta chain sequences. Since this dataset is highly imbalanced, the authors exclude epitopes with less than 15 associated TCR sequences and downsample to a limit of 400 TCRs per epitope. The dataset contains amino acid sequences either for the entire TCR or only for the hypervariable CDR3 loop. Epitopes are available as amino acid sequences. Since Weber et al. proposed to represent the peptides as SMILES strings (which reformulates the problem to protein-ligand binding prediction) the SMILES strings of the epitopes are also included. 50% negative samples were generated by shuffling the pairs, i.e. associating TCR sequences with epitopes they have not been shown to bind.

Task Description: Binary classification. Given the epitope (a peptide, either represented as amino acid sequence or as SMILES) and a T-cell receptor (amino acid sequence, either of the full protein complex or only of the hypervariable CDR3 loop), predict whether the epitope binds to the TCR.

Dataset Statistics: 47,182 TCR-Epitope pairs between 192 epitopes and 23,139 TCRs.

Dataset Split: Cold TCR Split Cold TCR/Epitope Split

from tdc.multi_pred import TCREpitopeBinding 
data = TCREpitopeBinding(name = 'weber', path = './data')
split = data.get_split()


[1] Weber, Anna, Jannis Born, and María Rodriguez Martínez. “TITAN: T-cell receptor specificity prediction with bimodal attention networks.” Bioinformatics 37.Supplement_1 (2021): i237-i244.

[2] Bagaev, Dmitry V., et al. “VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium.” Nucleic Acids Research 48.D1 (2020): D1057-D1062.

[3] Dines, Jennifer N., et al. “The immunerace study: A prospective multicohort study of immune response action to covid-19 events with the immunecode™ open access database.” medRxiv (2020).

Dataset License: CC BY 4.0.

Contributed by: Anna Weber and Jannis Born.