ADMET Benchmark Group

ADMET is one of the cornerstones for small molecule drug discovery, where it defines the drug's efficacy and toxicity's profile. Thus, a machine learning model that can accurately predict all of the ADMET properties given the drug's structural information is highly valuable. We select 22 ADMET datasets from TDC's collection and formulate it as a benchmark group. ADMET Group contains the following dataset:

from tdc import utils
names = utils.retrieve_benchmark_names('ADMET_Group')
# ['caco2_wang', 'hia_hou', ....]

To access each benchmark, for example, Caco2_Wang, type:

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'ADMET_Group', path = 'data/')
benchmark = group.get('Caco2_Wang')

predictions = {}
name = benchmark['name']
train_val, test = benchmark['train_val'], benchmark['test']

## --- train your model --- ##

predictions[name] = y_pred
group.evaluate(predictions)
# {'caco2_wang': {'mae': 0.234}}

Follow the instruction on how to use the BenchmarkGroup class and training validation split, and also submission instructions.

For every dataset, we use scaffold split and hold out 20% for test set. The evaluation metrics are selected given the following criteria:

  • For binary classification:
    • AUROC is used when the number of positive and negative samples are close.
    • AUPRC is used when the number of positive samples are much smaller than negative samples.
  • For regression:
    • MAE is used for majority of benchmarks.
    • Spearman's correlation coefficient is used for benchmarks that depend on factors beyond the chemical structure.

We encourage submissions that reports results for the entire benchmark group. Still, we welcome and accept submissions that report partial results, for example, for just one of the five ADMET categories.


Benchmark Data Summary

Absorption

Absorption measures how a drug travels from the site of administration to site of action.

Dataset Unit Number Task Metric Split
Caco2 cm/s 906 Regression MAE Scaffold
HIA % 578 Binary AUROC Scaffold
Pgp % 1,212 Binary AUROC Scaffold
Bioav % 640 Binary AUROC Scaffold
Lipo log-ratio 4,200 Regression MAE Scaffold
AqSol log mol/L 9,982 Regression MAE Scaffold

Distribution

Drug distribution refers to how drug moves to and from the various tissues of the body and the amount of drugs in the tissues.

Dataset Unit Number Task Metric Split
BBB % 1,975 Binary AUROC Scaffold
PPBR % 1,797 Regression MAE Scaffold
VDss L/kg 1,130 Regression Spearman Scaffold

Metabolism

Drug metabolism measures how specialized enzymatic systems breakdown the drugs and it determines the duration and intensity of a drug's action.

Summary

Dataset Unit Number Task Metric Split
CYP2C9 Inhibition % 12,092 Binary AUPRC Scaffold
CYP2D6 Inhibition % 13,130 Binary AUPRC Scaffold
CYP3A4 Inhibition % 12,328 Binary AUPRC Scaffold
CYP2C9 Substrate % 666 Binary AUPRC Scaffold
CYP2D6 Substrate % 664 Binary AUPRC Scaffold
CYP3A4 Substrate % 667 Binary AUROC Scaffold

Excretion

Drug excretion is the removal of drugs from the body using various different routes of excretion, including urine, bile, sweat, saliva, tears, milk, and stool.

Summary

Dataset Unit Number Task Metric Split
Half Life hr 667 Regression Spearman Scaffold
CL-Hepa uL.min-1.(10^6 cells)-1 1,020 Regression Spearman Scaffold
CL-Micro mL.min-1.g-1 1,102 Regression Spearman Scaffold

Toxicity

Toxicity measures how much damage a drug could cause to organisms.

Summary

Dataset Unit Number Task Metric Split
LD50 log(1/(mol/kg)) 7,385 Regression MAE Scaffold
hERG % 648 Binary AUROC Scaffold
Ames % 7,255 Binary AUROC Scaffold
DILI % 475 Binary AUROC Scaffold