Quick Start Guide

Machine learning applied to discovery and development of therapeutics is an increasingly popular area of research. To facilitate progress in this field, we introduce the Therapeutics Data Commons, a large set of biomedically relevant learning tasks spread across different domains of drug discovery and development. We curate tasks into specific training, validation, and test splits to ensure that each task can generalizes and transfers to real-world scenarios. TDC will help the machine learning community focus on solving scientifically meaningful problems.

Vision for TDC

All methods and datasets are integrated and are accessible via an open source TDC package. TDC is built upon numerous public data sources such as curated database, journal supplementary, bioassay and so on. The full collection currently includes 66 datasets covering a wide range of tasks and biomedical products.


Installation

To install the TDC package, open the terminal and type:

pip install PyTDC

The core data loaders in TDC are lightweight. The installation of the TDC package is hassle-free with minimum dependency on external packages.


Design of Therapeutics Data Commons

TDC has a unique three-tiered hierarchical structure, which to our knowledge, is the first attempt at systematically organizing machine learning for therapeutics. We organize TDC into three distinct problems. For each problem, we give a collection learning tasks. Finally, for each task, we provide a series of datasets.

In the first tier, after observing a large set of therapeutics tasks, we categorize and abstract out three major areas (i.e., problems) where machine learning can facilitate scientific advances, namely, single-instance prediction, multi-instance prediction, and generation:

  • Single-instance prediction single_pred: Prediction of property given individual biomedical entity.
  • Multi-instance prediction multi_pred: Prediction of property given multiple biomedical entities.
  • Generation generation: Generation of new desirable biomedical entities.
TDC hierarchy

The second tier in the TDC structure is organized into learning tasks. Improvement on these tasks can result in numerous applications, including identifying personalized combinatorial therapies, designing novel class of antibodies, improving disease diagnosis, and finding new cures for emerging diseases.

Finally, in the third tier of TDC, each task is instantiated via multiple datasets. For each dataset, we provide several splits of the dataset into training, validation, and test sets to simulate the type of understanding and generalization (e.g., the model's ability to generalize to entirely unseen compounds or to granularly resolve patient response to a polytherapy) needed for transition into production and clinical implementation.

TDC problems

Data Loaders

TDC provides a collection of workflows with intuitive, high-level APIs for both beginners and experts to create machine learning models in Python. Building off the modularized "Problem--Learning Task--Data Set" structure (see above) in TDC, we provide a three-layer API to access any learning task and dataset. This hierarchical API design allows us to easily incorporate new tasks and datasets.

TDC api

For a concrete example, to obtain the Caco2 dataset from ADME therapeutic learning task in the single-instance prediction problem:

from tdc.single_pred import ADME
data = ADME(name = 'Caco2_Wang')
df = data.get_data()
splits = data.get_split()

The variable df is a Pandas object for the entire dataset. The variable splits would in default output a dictionary where keys are train, val, test and values are the pandas DataFrame with the Drug IDs, SMILES strings and the label. For detailed outputs, see the dataset documentation.

Note that the user only needs to specify "Problem--Learning Task--Data Set" and TDC automatically retrieve the processed machine learning-ready dataset from TDC server and generate a data object, which contains numerous utility functions that can be directly applied on the dataset.


Data Functions

TDC implements a comprehensive suite of auxiliary functions frequently used in therapeutics ML. This functionality is wrapped in an easy-to-use interface. Broadly, we provide functions for the following four major categories:

  • Model Evaluation : TDC includes a series of realistic metrics functions to evaluate models for therapeutics ML tasks in TDC such that models are transferable to real-world scenarios.
  • Data Split: Real world applications require ML models to generalize to out-of-distribution samples. TDC includes various data splits to reflect realistic generalization schemes.
  • Data Processing : As therapeutics ML covers a wide range of data modalities and has numerous repetitive processing functions, TDC summarizes and provides wrappers for useful data helpers.
  • Molecule Generation Oracles: Molecular design tasks require oracle functions to measure the quality of generated entities. TDC provides oracles where each is tailored for a specific goal of interest.

See the functions documentation for detail usage for each function.


Cite Us

If you use TDC in your work, please cite:

@article{tdc,
  title={Therapeutics Data Commons: Machine Learning Datasets and Tasks for Therapeutics},
  author={Huang, Kexin and Fu, Tianfan and Gao, Wenhao and Zhao, Yue and Roohani, Yusuf and Leskovec, Jure and Coley, Connor W and Xiao, Cao and Sun, Jimeng and Zitnik, Marinka},
  journal={arXiv preprint arXiv:2102.09548},
  year={2021}
}

If you use any of the datasets, please make sure to also cite the primary data source as described in the Datasets section.


Start Exploring Therapeutics Data Commons: