Docking Molecule Generation Benchmark Group


AI-assisted molecule generation aims to generate novel molecular structures with desired properties. Current techniques for evaluating the quality of generated molecules focus on heuristic oracles, such as QED, LogP, and DRD2, and do not reflect the complexity of the real-world environment. This creates several key challenges. Many properties, such as binding propensity towards a target protein, are incredibly resource-intensive to investigate through experiments or computational simulations. For this reason, techniques that require a large number of calls to oracles do not constitute a practical strategy to evaluate the generated molecules. Even in settings where generated molecules score highly by some oracles, the generated molecules may not have other necessary properties (e.g., low synthesizability) to constitute promising therapeutic candidates.

To address these challenges, we designed a docking benchmark group [Cieplinski et al. 2020, Steinmann and Jensen, 2021]. Docking is a theoretical evaluation of affinity between a ligand (a small molecular drug) and a target (a protein involved in the disease). As a molecule with higher affinity is more likely to have higher bioactivity, docking is widely used for virtual screening of compounds [Lyu et al. 2020].

This benchmark evaluates generated molecules against their affinity to the target protein (i.e., quantified through docking scores). To this end, the benchmark is structured as follows:

  • As docking scores are relatively costly to calculate, we restrict the number of oracle calls in this benchmark requiring the models to adapt quickly. This setup simulates a real-world environment with a limited number of wet-lab experiments that can be carried out.
  • In addition to typical oracle scores, we provide additional tests to evaluate how realistic the generated molecules are.

The training dataset originates from ZINC 250K.

Accessing the Dataset

To retrieve the names of benchmarks constitute this benchmark group, type the following:

from tdc import utils
names = utils.retrieve_benchmark_names('Docking_Group')
# ['DRD3', 'XXX', ...]

To access a benchmark, use the following code:

from tdc import BenchmarkGroup
group = BenchmarkGroup(name = 'Docking_Group', 
                path = 'data/', 
                pyscreener_path = 'PATH_TO_PyScreener')

benchmark = group.get('DRD3', num_max_call = 5000) 
# specify the number of maximum calls your model plans to use 

predictions = {}
oracle_fct, data, name = benchmark['oracle'], benchmark['data'], benchmark['name'] 

# --------------------------------------------- # 
#  Train your model using oracle_fct and data   #
#    Save SMILES generation in pred_smiles      #
# --------------------------------------------- #

predictions[name] = pred_smiles 
pred_smiles format is a dictionary of the top 100 generated SMILES: 
{5000: {
  'C=C=C=C(C#CON(N=O)C(=NO)C(O)=NC#CC)C(N=CC)(ON=NO)C(=CNN=NNCC)OO': -6.0,
  '(O)OC#CNOC(N)C(=O)NOC(CN)C1=C=C=C=C1': -9.8
 ### if you also evaluate on 1000/500/100 maximum calls
 1000: {

Note: if no docking score is generated by the model for the final 
100 SMILES, you can also input a list of 100 generated SMILES and 
TDC will call the docking scores in the evaluate function.

out = group.evaluate(predictions, m1_api = 'XXXXX')
{'DRD3': {5000: {'top100': -10.2,
          'top10': -11.3,
          'top1': -12.3,
          'diversity': 0.6,
          'novelty': 0.7,
          '%pass': 0.7,
          'top1_%pass': -11.2,
          'm1': 2.5,
          'top smiles': ['XXX', 'XXX', ...]
        1000: {....

Note that if you put save_dict = True in evaluate function, it 
would return more detailed evaluation outcomes, namely the list of 
smiles pass the filter, a dictionary of smiles with m1 scores and docking scores.

We ask users to submit at least three random runs of models for robustness. 
You can use following functions to obtain submission ready format:
predictions_runs = [pred_run1, pred_run2, pred_run3]
out = group.evaluate_many(predictions_runs, save_file_name = 'result', m1_api = 'XXXXX')
{'DRD3': {5000: {'top100': [-10.2, 0.12],
          'top10': [-11.3, 0.01],
          'top1': [-12.3, 0.02],
          'diversity': [0.6, 0.001],
          'novelty': [0.7, 0.01],
          '%pass': [0.7, 0.02],
          'top1_%pass': [-11.2, 0.03],
          'm1': [5.5, 0.04],
          'top smiles': ['XXX', 'XXX', ...] # superset of runs
        1000: {....

In default, this evaluate_many function will call group.evaluate for each run. If you have the evaluate result for each fold, simply specify it with 'results_individual = XX' to skip the evaluation calls.

Performance Evaluation

To evaluate the quality of generated molecules, we report the following metrics (average and standard deviation across 3 or more independent runs):

  • top100: Average docking score of top-100 generated molecules for a given target.
  • top10: Average docking score of top-10 generated molecules for a given target.
  • top1: The lowest docking score of generated molecules.
  • diversity: Average pairwise Tanimoto distance of Morgan fingerprints for top-100 generated molecules.
  • novelty: Fraction of generated molecules that are not present in the training set.
  • m1: Synthesizability score of molecules obtained via retrosynthesis model.
  • %pass: Fraction of generated molecules that successfully pass through a-priori defined filters. These filters are rules compiled by medicinal chemists and test whether compounds are promising candidates for downstream analyses.
  • top1_%pass (top1-p): The lowest docking score for molecules molecules that are not filtered out.
  • molecules: Visualizations of molecular structure of the superset of top-100 molecules across independent runs of the model.

Note that all evaluations are automatically computed by group.evaluate function with the exception of m1 evaluation metric.

To include m1 in the evaluation, specify the m1_api token in the evaluation function. Note that this is a non-commercial service kindly provided to TDC by our partner organization Please follow's terms of usage if you plan to calculate m1 scores. You can opt-out of m1 and submit your results without it, by not specifying the m1_api token. Check out this page to get the API token and learn about terms of usage.

Maximum Number of Calls to Oracles

To simulate the resource-intensiveness in real-world molecule generation, we restrict the number of maximum calls to oracles in O(103). To evaluate models in increasingly harder learning regimes, we provide four leaderboards, each allowing only a certain number of oracle calls. The smaller the number of allowed oracle calls, the harder is the learning task. You can specify the maximum number of oracle calls as follows: group.get('DRD3', num_max_call = 5000). We currently support leaderboards with a maximum of 100 (toughest learning regime), 500, 1000, or 5000 (the least tough learning regime) calls to the oracles.


This benchmark requires the use of TDC oracle class. For docking score, it requires PyScreener. You can find the detailed installation steps here.

To submit your result, please fill out THIS FORM. The evaluation result file result.pkl will be automatically generated after calling group.evaluate_many(predictions_runs, save_file_name = 'result', m1_api = 'XXXXX').

Leaderboard Data Summary

Dataset Diseases Link to DRD3 target protein
TDC.DRD3 Tremor, Schizophrenia Uniprot Page