Introduction

Developing therapeutics to cure diseases and improve human health is an overarching goal of biomedical research. With the advent of high-throughput techniques in genome sequencing, compound screening and the digitization of health records, large amounts of rich and diverse data have become available for research and analysis. The availability of large datasets call for the development of algorithmic and AI infrastructure and AI-ready datasets to facilitate discovery.

TDC is the first unifying framework to systematically access and evaluate ML across the entire range of therapeutics. At its core, TDC is a collection of curated datasets and learning tasks that can translate algorithmic innovation into biomedical and clinical implementation. Datasets in TDC cover a wide range of therapeutic products (e.g., small molecule, biologics, gene editing) across the entire discovery and development pipeline (i.e., target identification, hit discovery, lead optimization, manufacturing).

Overview of TDC

Importantly, datasets in TDC are AI/ML-ready, where input features are processed into the most accessible format possible, such that ML researchers can use them directly as input to ML algorithms. Further, TDC provides numerous functions for model evaluation, meaningful data splits, data processors, and oracles for molecule generation, all accessible via an open source TDC package. All features of TDC are designed to easily integrate into any ML workflow. To this end, we develop an open-source software library that allows for efficient retrieval of any TDC dataset and implements supporting functions for learning tasks.

We also provide realistic leaderboards where we curate learning tasks into meaningful train, validation, and test splits to provide a systematic model development and evaluation framework and test the extent to which algorithmic advancements transfer to real-world settings.

Facilitating algorithmic and scientific advance in the broad area of therapeutics
We envision TDC to be the meeting point between domain scientists and ML scientists. Domain scientists can pose learning tasks and identify relevant datasets that are carefully processed and integrated into the TDC and formulated as scientifically valid learning tasks. ML scientists can then rapidly obtain these tasks and ML-ready datasets through the TDC programming framework and use them to design powerful ML methods. Predictions and other outputs produced by ML models can then facilitate algorithmic and scientific advances in therapeutics.

Vision for TDC


Tiered Design of Therapeutics Data Commons: “Problem–Learning Task–Data Set”

TDC has an unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing ML for therapeutics. We organize TDC into three distinct problems. For each problem, we give a collection learning tasks. Finally, for each task, we provide a series of datasets.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major problems:

  • Single-instance prediction single_pred: Prediction of property for an individual biomedical entity.
  • Multi-instance prediction multi_pred: Prediction of property for multiple biomedical entities.
  • Generation generation: Generation of a new desirable biomedical entity.
TDC hierarchy

The second tier in the TDC structure is organized into learning tasks. Improvement on these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel class of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

In the third tier of TDC, we provide multiple datasets for each task. For each dataset, we provide several splits of the dataset into training, validation, and test sets to evaluate model performance.

TDC problems

Installation

To install the TDC package, open the terminal and type:

pip install PyTDC

The core data loaders in TDC are lightweight. The installation of the TDC package is hassle-free with minimum dependency on external packages.


Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create ML models in Python. Building off the modularized "Problem--Learning Task--Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset.

TDC api

As an example, to obtain the Caco2 dataset from ADME task in the single-instance prediction problem do as follows:

from tdc.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
df = data.get_data()
splits = data.get_split()

The variable df is a Pandas object holding the entire dataset. By default, the variable splits is a dictionary with keys train, val, and test whose values are all Pandas DataFrames with Drug IDs, SMILES strings and labels. For detailed information about outputs, see Datasets documentation.

The user only needs to specify "Problem--Learning Task--Data Set." TDC then automatically retrieves the processed AI/ML-ready dataset from the TDC server and generates a data object, exposing numerous data functions that can be directly applied to the dataset.


Ecosystem of Data Functions, Tools, Libraries, and Community Resources

TDC includes numerous data functions that can be readily used with any TDC dataset. TDC divides its programmatic ecosystem into four broad categories:

  • Model Evaluation: TDC implements a series of metrics and performance functions to debug ML models, evaluate model performance for any task in TDC, and assess whether model predictions generalize to out-of-distribution datasets.
  • Dataset Splits: Therapeutic applications require ML models to generalize to out-of-distribution samples. TDC implements various data splits to reflect realistic learning settings.
  • Data Processing: As therapeutics ML covers a wide range of data modalities and requires numerous repetitive processing functions, TDC implements wrappers and useful data helpers for them.
  • Molecule Generation Oracles: Molecular design tasks require oracle functions to measure the quality of generated entities. TDC implements over 17 molecule generation oracles, representing the most comprehensive colleciton of molecule oracles. Each oracle is tailored to measure the quality of AI-generated molecules in a specific dimension.

For further information see Data Functions documentation.


Public Leaderboards

TDC provides leaderboards for systematic model evaluation and comparison. Each dataset in TDC can be thought of as a benchmark. For a model to be useful for a specific therapeutic question, it needs to consistently perform well across multiple datasets and tasks. For this reason, we group individual benchmarks in TDC into meaningful groups, which we refer to as benchmark groups. Datasets and tasks within a benchmark group are carefully selected and centered around the therapeutic question. Further, dataset splits and evaluation metrics are also thoughtfully selected to reflect the difficulty of the therapeutic question in a real-world setting.

For further information see Benchmark documentation.


Cite Us

If you use TDC in your work, consider citing our manuscript:

@article{tdc2021,
  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, 
          Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={arXiv preprint arXiv:2102.09548},
  year={2021}
}

If you use any of the datasets, make sure to also cite the primary data source as described in Datasets section.


Start Exploring Therapeutics Data Commons