ADMET Benchmark Group

ADMET is a cornerstone of small molecule drug discovery, defining drug efficacy and toxicity profile. An ML model that could accurately predict all ADMET properties using structural information of compounds would be greatly valuable.

We formulate the ADMET Benchmark Group using 22 ADMET datasets in TDC. The ADMET Group contains the following datasets:

from tdc import utils
names = utils.retrieve_benchmark_names('ADMET_Group')
# ['caco2_wang', 'hia_hou', ....]

Type the following to access any benchmark in the group, for example, Caco2_Wang:

from tdc.benchmark_group import admet_group
group = admet_group(path = 'data/')
benchmark = group.get('Caco2_Wang')

predictions = {}
name = benchmark['name']
train_val, test = benchmark['train_val'], benchmark['test']

## --- train your model --- ##

predictions[name] = y_pred
group.evaluate(predictions)
# {'caco2_wang': {'mae': 0.234}}

Follow the instructions on how to use the BenchmarkGroup class and obtain training, validation, and test sets, and how to submit your model to the leaderboard.

For every dataset in the benchmark group, we use the scaffold split to partition the dataset into training, validation, and test sets. We hold out 20% data samples for the test set. The performance metrics are:

For binary classification:

AUROC is used when the number of positive and negative samples are similar.
AUPRC is used when the number of positive samples are much smaller than negative samples.

For regression:

MAE is used for majority of benchmarks.
Spearman's correlation coefficient is used for benchmarks that depend on factors beyond the chemical structure.

We encourage submissions that reports results for the entire benchmark group. Still, we welcome and accept submissions that report partial results, for example, submissions with results for just one out of five ADMET categories.

Benchmark Data Summary

Absorption

Absorption measures how a drug travels from the site of administration to site of action.

Dataset	Unit	Size	Task	Metric	Dataset Split
Caco2	cm/s	906	Regression	MAE	Scaffold
HIA	%	578	Binary	AUROC	Scaffold
Pgp	%	1,212	Binary	AUROC	Scaffold
Bioav	%	640	Binary	AUROC	Scaffold
Lipo	log-ratio	4,200	Regression	MAE	Scaffold
AqSol	log mol/L	9,982	Regression	MAE	Scaffold

Distribution

Drug distribution refers to how drug moves to and from the various tissues of the body and the amount of drugs in the tissues.

Dataset	Unit	Size	Task	Metric	Dataset Split
BBB	%	1,975	Binary	AUROC	Scaffold
PPBR	%	1,797	Regression	MAE	Scaffold
VDss	L/kg	1,130	Regression	Spearman	Scaffold

Metabolism

Drug metabolism measures how specialized enzymatic systems breakdown the drugs and it determines the duration and intensity of a drug's action.

Dataset	Unit	Size	Task	Metric	Dataset Split
CYP2C9 Inhibition	%	12,092	Binary	AUPRC	Scaffold
CYP2D6 Inhibition	%	13,130	Binary	AUPRC	Scaffold
CYP3A4 Inhibition	%	12,328	Binary	AUPRC	Scaffold
CYP2C9 Substrate	%	666	Binary	AUPRC	Scaffold
CYP2D6 Substrate	%	664	Binary	AUPRC	Scaffold
CYP3A4 Substrate	%	667	Binary	AUROC	Scaffold

Excretion

Drug excretion is the removal of drugs from the body using various different routes of excretion, including urine, bile, sweat, saliva, tears, milk, and stool.

Dataset	Unit	Size	Task	Metric	Dataset Split
Half Life	hr	667	Regression	Spearman	Scaffold
CL-Hepa	uL.min-1.(10^6 cells)-1	1,020	Regression	Spearman	Scaffold
CL-Micro	mL.min-1.g-1	1,102	Regression	Spearman	Scaffold

Toxicity

Toxicity measures how much damage a drug could cause to organisms.

Dataset	Unit	Size	Task	Metric	Dataset Split
LD50	log(1/(mol/kg))	7,385	Regression	MAE	Scaffold
hERG	%	648	Binary	AUROC	Scaffold
Ames	%	7,255	Binary	AUROC	Scaffold
DILI	%	475	Binary	AUROC	Scaffold