Antibody Developability Prediction Task Overview

Definition: Immunogenicity, instability, self-association, high viscosity, polyspecificity, or poor expression can all preclude an antibody from becoming a therapeutic. Early identification of these negative characteristics is essential. This task is to predict the developability from the amino acid sequences.

Impact: A fast and reliable developability predictor can accelerate the antibody development by reducing wet-lab experiments. They can also alert the chemists to foresee potential efficacy and safety concerns and provide signals for modifications. Previous works have devised accurate developability index based on 3D structures of antibody. However, 3D information are expensive to acquire. A machine learning that can calculate developability based on sequence information is thus highly ideal.

Generalization: The model is expected to be generalized to unseen classes of antibodies with various structural and functional characteristics.

Product: Antibody.

Pipeline: Efficacy and safety.

TAP

Dataset Description: Immunogenicity, instability, self-association, high viscosity, polyspecificity, or poor expression can all preclude an antibody from becoming a therapeutic. Early identification of these negative characteristics is essential. Akin to the Lipinski guidelines, which measure druglikeness in small molecules, Therapeutic Antibody Profiler (TAP) highlights antibodies that possess characteristics that are rare/unseen in clinical-stage mAb therapeutics. In this dataset, TDC includes five metrics measuring developability of an antibody: CDR length, patches of surface hydrophobicity (PSH), patches of positive charge (PPC), patches of negative charge (PNC), structural Fv charge symmetry parameter (SFvCSP).

Task Description: Regression. Given the antibody's heavy chain and light chain sequence, predict its developability. The input X is a list of two sequences where the first is the heavy chain and the second light chain.

Dataset Statistics: 242 antibodies.

Dataset Split: Random Split

Note: TAP contains five developability metrics. To retrieve the specific labels for that metric, specify the label name in the label_name variable to the data loader. You can find all available label names by calling:

from tdc.utils import retrieve_label_name_list
label_list = retrieve_label_name_list('TAP')

Then, go to the standard TDC data loader procedure with the label name specified.

from tdc.single_pred import Develop
data = Develop(name = 'TAP', label_name = label_list[0])
split = data.get_split()

References:

[1] Raybould, Matthew IJ, et al. “Five computational developability guidelines for therapeutic antibody profiling.” Proceedings of the National Academy of Sciences 116.10 (2019): 4025-4030.

Dataset License: CC BY 4.0.


SAbDab, Chen et al.

Dataset Description: Antibody data from Chen et al, where they process from the SAbDab. From an initial dataset of 3816 antibodies, they retained 2426 antibodies that satisfy the following criteria: 1. have both sequence (FASTA) and Protein Data Bank (PDB) structure files, 2. contain both a heavy chain and a light chain, and 3. have crystal structures with resolution < 3 Å. The DI label is derived from BIOVIA's pipelines.

Task Description: Binary classification. Given the antibody's heavy chain and light chain sequence, predict its developability. The input X is a list of two sequences where the first is the heavy chain and the second light chain.

Dataset Statistics: 2,409 antibodies.

Dataset Split: Random Split

from tdc.single_pred import Develop
data = Develop(name = 'SAbDab_Chen')
split = data.get_split()

Note: Since 0.3.7, TDC now supports graphein protein representation for this task. To obtain protein representation, see a tutorial here!

References:

[1] Chen, Xingyao, et al. “Predicting antibody developability from sequence using machine learning.” bioRxiv (2020).

[2] Dunbar, James, et al. “SAbDab: the structural antibody database.” Nucleic acids research 42.D1 (2014): D1140-D1146.

[3] Biovia, Dassault Systèmes. “BIOVIA pipeline pilot.” Dassault Systèmes: San Diego, BW, Release (2017).

Dataset License: CC BY 3.0.