Introduction

Developing therapeutics to cure diseases and improve human health is an overarching goal of biomedical research. However, the lack of high-quality benchmarks impedes the advancement of artificial intelligence for drug discovery. To this end, TDC supports the development of novel ML theory and methods, with a strong bent towards developing the foundations of which ML algorithms are most suitable for drug discovery applications and why.

TDC is the first unifying resource to systematically access and evaluate ML capability across the entire range of therapeutics. It contains curated AI/ML-ready datasets, ML tasks, benchmarks and leadearboards, and extensive Python programming functionality. Datasets in TDC cover a wide range of therapeutic products (e.g., small molecules, biologics, genome editing therapeutics) across the entire discovery and development pipeline (e.g., target identification, hit discovery, lead optimization, manufacturing).

Overview of TDC

Importantly, datasets in TDC are AI/ML-ready, where input features are processed into the most accessible format possible, such that scientists can use them directly as input to ML methods. Further, TDC provides numerous functions for model evaluation, meaningful data splits, data processors, and oracles for molecule generation, all accessible via an open source TDC package. All features of TDC are designed to easily integrate into any ML workflow. To this end, we develop an open-source software library that allows for efficient retrieval of any TDC dataset and implements supporting functions for learning tasks.

TDC also provides public benchmarks. Each benchmark has a carefully designed ML task, AI/ML-ready dataset, a public leaderboard, and performance metrics to support systematic model benchmarking and test how ML methods generalize to settings encountered in real-world drug discovery implementation.

Facilitating algorithmic and scientific advance in the broad area of therapeutics
We envision TDC to be the meeting point between domain scientists and ML scientists. Domain scientists can pose AI/ML tasks and identify relevant datasets that are carefully processed and integrated into the TDC and formulated as scientifically valid AI/ML tasks. ML scientists can then rapidly obtain these tasks and AI/ML-ready datasets in TDC and use them to design powerful ML methods. Predictions and other outputs produced by ML methods can facilitate algorithmic and scientific advances in the broad area of therapeutics.

Vision for TDC


Tiered Design of Therapeutics Data Commons: “Problem – Learning Task – Data Set”

TDC has an unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing ML for therapeutics. We organize TDC into three distinct problems. For each problem, we give a collection learning tasks. Finally, for each task, we provide a series of datasets.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major problems:

  • Single-instance prediction single_pred: Prediction of property for an individual biomedical entity.
  • Multi-instance prediction multi_pred: Prediction of property for multiple biomedical entities.
  • Generation generation: Generation of a new desirable biomedical entity.
TDC hierarchy

The second tier in the TDC structure is organized into learning tasks. Improvement on these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel class of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

In the third tier of TDC, we provide multiple datasets for each task. For each dataset, we provide several splits of the dataset into training, validation, and test sets to evaluate model performance.

TDC problems

Installation

To install the TDC package, open the terminal and type:

pip install PyTDC

The core data loaders in TDC are lightweight. The installation of the TDC package is hassle-free with minimum dependency on external packages.


Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create ML models in Python. Building off the modularized "Problem--Learning Task--Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset.

TDC api

As an example, to obtain the Caco2 dataset from ADME task in the single-instance prediction problem do as follows:

from tdc.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
df = data.get_data()
splits = data.get_split()

The variable df is a Pandas object holding the entire dataset. By default, the variable splits is a dictionary with keys train, val, and test whose values are all Pandas DataFrames with Drug IDs, SMILES strings and labels. For detailed information about outputs, see Datasets documentation.

The user only needs to specify "Problem -- Machine Learning Task -- Data Set." TDC then automatically retrieves the processed AI/ML-ready dataset from the TDC server and generates a data object, exposing numerous data functions that can be directly applied to the dataset.


Ecosystem of Data Functions, Tools, Libraries, and Community Resources

TDC includes numerous data functions that can be readily used with any TDC dataset. TDC divides its programmatic ecosystem into four broad categories:

  • Model Evaluation: TDC implements a series of metrics and performance functions to debug ML models, evaluate model performance for any task in TDC, and assess whether model predictions generalize to out-of-distribution datasets.
  • Dataset Splits: Therapeutic applications require ML models to generalize to out-of-distribution samples. TDC implements various data splits to reflect realistic learning settings.
  • Data Processing: As therapeutics ML covers a wide range of data modalities and requires numerous repetitive processing functions, TDC implements wrappers and useful data helpers for them.
  • Molecule Generation Oracles: Molecular design tasks require oracle functions to measure the quality of generated entities. TDC implements over 17 molecule generation oracles, representing the most comprehensive colleciton of molecule oracles. Each oracle is tailored to measure the quality of AI-generated molecules in a specific dimension.

For further information see Data Functions documentation.


Public Leaderboards

TDC provides leaderboards for systematic model evaluation and comparison. Each dataset in TDC can be thought of as a benchmark. For a model to be useful for a specific therapeutic question, it needs to consistently perform well across multiple datasets and tasks. For this reason, we group individual benchmarks in TDC into meaningful groups, which we refer to as benchmark groups. Datasets and tasks within a benchmark group are carefully selected and centered around the therapeutic question. Further, dataset splits and evaluation metrics are also thoughtfully selected to reflect the difficulty of the therapeutic question in a real-world setting.

For further information see Benchmark documentation.


Start Exploring Therapeutics Data Commons