News
2024.06.22 TDC-2 preprint is released!
TDC-2 Features
- 10+ new modalities: TDC-2 drastically expands the coverage of ML tasks across therapeutic pipelines and 10+ new modalities, spanning but not limited to single-cell gene expression data, clinical trial data, peptide sequence data, peptidomimetics protein-peptide interaction data regarding newly discovered ligands derived from AS-MS spectroscopy, novel 3D structural data for proteins, and cell-type-specific protein-protein interaction networks at single-cell resolution.
- Single-cell atlases and foundation model embeddings: TDC-2 introduces over 1,000 multimodal datasets, spanning approximately 85 million cells and pre-calculated embeddings from 5 state-of-the-art single-cell models via CZ CELLxGENE Census and the TDC Model Hub
- API-First Multimodal Retrieval API via TDC-2 MVC and Resource: TDC-2 drastically expands dataset retrieval capabilities available in TDC-1 beyond those of other leading benchmarks. The software architecture of TDC-2 was redesigned using the Model-View-Controller (MVC) design pattern.
The MVC pattern supports the integration of multiple data modalities by using data mappings and views. The MVC-enabled-multimodal retrieval API is powered by TDC-2’s Resource Model and a Domain-Specific-Language.
- TDC-2 Domain-Specific Language: TDC-2 develops an Application-Embedded Domain-Specific Data Definition Programming Language that facilitates the integration of multiple modalities by generating data views from a mapping of various datasets and functions for transformations, integration, and multimodal enhancements while maintaining a high level of abstraction for the Resource framework.
- TDC-2 Resource Model: The Commons introduces a redesign of TDC-1’s dataset layer into a new data model dubbed the TDC-2 resource, developed under the MVC paradigm to integrate multiple modalities into the API-first model of TDC-2. We leverage the CZ CellXGene to develop a TDC-2 Resource Model for constructing large-scale single-cell datasets that map gene expression profiles of individual cells across tissues, healthy, and disease states.
- Biomedical Knowledge Graphs and External APIs: We have developed a framework for biomedical knowledge graphs to enhance the multimodality of dataset retrieval via TDC-2’s Resource Model. Our system leverages PrimeKG to integrate 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships. Our framework also extends to external APIs, with data views currently leveraging BioPython, for obtaining nucleotide sequence information for a given non-coding RNA ID from NCBI, and The Uniprot Consortium’s RESTful GET API for obtaining amino acid sequences
- TDC-2 Model Hub: In addition, we’ve developed a framework that allows access to predictive and foundation embedding models under diverse biological contexts via the TDC-2 Model Hub. TDC-2 releases AI-powered endpoints via The Commons' Model Hub, which enhances multimodal retrieval capabilities by providing access to protein embeddings under cell-type-specific biological contexts and model predictions for key biomedical challenges including but not limited to binary classification on SMILES strings for Ether-a-go-go-related gene blockers, blood-brain-barrier permeability, and CYP3A4 inhibition. These models can be fine-tuned under TDC-2's fine-tuning paradigm and/or used for innovative downstream tasks.
- 7 innovative multimodal ML Tasks and benchmarks: TDC-2 introduces 7 novel ML tasks with fine-grained biological contexts: contextualized drug-target identification, single-cell chemical/genetic perturbation response prediction, protein-peptide binding affinity prediction task, and clinical trial outcome prediction task, which introduce antigen-processing-pathway-specific, cell-type-specific, peptide-specific, and patient-specific biological contexts. TDC-2 also releases benchmarks evaluating 15+ state-of-the-art models across 5+ new learning tasks evaluating models on diverse biological contexts and sampling approaches. Among these, TDC-2 provides the first benchmark for context-specific learning. TDC-2, to our knowledge, is also the first to introduce a protein-peptide binding interaction benchmark. TDC-2's tasks, frameworks, datasets, and models are tailored to take on some of the most pressing machine learning challenges in biomedicine, including but not limited to cell-type-specific machine learning modeling and evaluation, the inferential gap in precision medicine, negative-sampling challenges in peptidomimetics, and model generalizability across unseen cell lines and perturbations.
For more information on these and additional features, please refer to the bioRxiv preprint.
2024.06.19TDC-2 preview was presented at MoML2024 hosted by Mila. You can see full conference here. Our poster can be seen in our tweet as well.
2023.07.10 TDC 0.4.1
is released! TDC has a new exciting task on clinical trial outcome prediction (Thanks to Tianfan)! Checkout here for more information.
2023.04. 17 TDC 0.4.0
is released! We're excited to announce the release of a new interface tdc_hf_interface
that allows users to easily access and leverage pre-trained models hosted at HuggingFace on TDC datasets and tasks. In this first batch, we've released nine pre-trained models from DeepPurpose that cover three popular ADMET datasets in the Commons. To load our pre-trainend model, simply do the following:
from tdc import tdc_hf_interface
tdc_hf = tdc_hf_interface("BBB_Martins-AttentiveFP")
dp_model = tdc_hf.load_deeppurpose('./data')
tdc_hf.predict_deeppurpose(dp_model, ['CC(=O)NC1=CC=C(O)C=C1'])
The TDC-HF space is located at here. Stay tuned for more exciting pre-trained models, tasks & demos!
2023.01.26 TDC 0.3.9
is released! Here are the changes:
- TDC has 9 new datasets on high throughput screening
HTS
. These assays cover a wide range of protein target classes and are carefully collated through confirmation screens to validate active compounds. See here on how to access them!
Protein Target Class | PubChem AID | Protein Target | Total # of Molecules | # of Active Molecules |
---|---|---|---|---|
GPCR | 435008 | Orexin1 Receptor | 218,158 | 233 |
GPCR | 1798 | M1 Muscarinic Receptor Agonists | 61,833 | 187 |
GPCR | 435034 | M1 Muscarinic Receptor Antagonists | 61,756 | 362 |
Ion Channel | 1843 | Potassium Ion Channel Kir2.1 | 301,493 | 172 |
Ion Channel | 2258 | KCNQ2 Potassium Channel | 302,405 | 213 |
Ion Channel | 463087 | Cav3 T-type Calcium Channels | 100,875 | 703 |
Transporter | 488997 | Choline Transporter | 302,306 | 252 |
Kinase | 2689 | Serine/Threonine Kinase 33 | 319,792 | 172 |
Enzyme | 485290 | Tyrosyl-DNA Phosphodiesterase | 341,365 | 281 |
- TDC has an additional dataset on hERG in the
Tox
task. See here for more info! - TDC now follows black code style!
2022.11.03 TDC 0.3.8
is released! Here are the changes:
- TDC has a new task on structure-based drug design
SBDD
with four datasets PDBBind, DUD-E, scPDB. See here on how to access them! - To support evaluation of SBDD tasks, we also include two evaluation metrics (RMSD, Kabsch-RMSD) that compare distances between two structures. See here for more info.
- TDC has a new dataset on PAMPA (parallel artificial membrane permeability assay), which is a commonly employed assay to evaluate drug permeability across the cellular membrane in the
ADME
task. See here for more info!
2022.09.06 TDC 0.3.7
is released! Here are the changes:
- TDC has a new evaluation metric on logAUC. See here and the PR.
- TDC now supports graphein protein 3D representation for antibody develop-ability prediction. See tutorial and the PR.
-
QM
task are now in 3D format. See here. - TDC has a harmonize function to deal with duplicated experimental entries in DTI. See here.
- TDC now has a dataloader for PrimeKG as an auxilliary resource. See how to access PrimeKG here.
- TDC fixed static scikit-learn version issue for gsk3b, jnk3, drd2 oracles. See here for more info.
- The PPBR dataset in ADME task now has additional species information and the default is now only containing homo sapiens while you can retrieve other species via a TDC function. See here for more info.
2022.02.19 TDC 0.3.6
is released! TDC has a new task on TCR-Epitope Binding prediction (Thanks to Anna and Jannis)! Checkout here for more information.
2022.01.23 TDC 0.3.5
is released! Here are the changes:
- TDC has an updated ChEMBL library (Version 29) in
MolGen
! The previous version is also still kept available. Checkout here for more information.
- Reaction type information can be found within split by turning on the include_reaction_type flag for USPTO-50 in
RetroSyn
! Checkout here for more information.
- Fixed bug on cold split for higher order (>2) multi-instance prediction tasks! (Thanks to Zoe !) Checkout here for more information.
2021.12.28 TDC 0.3.4
is released! Bug fixes on docking oracles and KL divergence measure.
2021.11.25 TDC 0.3.3
is released! Now added extended support for cold split in multi prediction tasks, see this issue!
2021.10.17 TDC 0.3.2
is released! We have added support for harmonizing same DTIs with different affinities (KIBA, DAVIS Updated accordingly, see this issue); support for label name retrieval for TWOSIDES (this issue), and add gene symbol info to GDSC (this issue).
2021.09.04 TDC 0.3.0
is released! We have greatly restructured the code to be contributor friendly while keeping most interfaces the same. We also release the documentation for TDC package at here.
2021.05.30 TDC updates to 0.2.0
, major changes:
- TDC has a new molecule generation benchmark on docking scores! Checkout here for more information.
2021.03.24 TDC updates to 0.1.9
, major changes:
- TDC now supports molecule filters! Checkout here for more information.
2021.03.17 TDC updates to 0.1.8
, major changes:
- Leaderboard is reformulated and we invite submission for each individual benchmark! Checkout here for more information.
2021.02.26 TDC updates to 0.1.7
, major changes:
- Streamlined leaderboard programming framework! Checkout here for more information.
- Label log transformation supported. Checkout here for more information.
2021.02.18 TDC just released the white paper in arXiv! Here is the link to the paper.
2021.02.04 TDC updates to 0.1.6
, major changes:
- New Leaderboard! Just released the second leaderboard on drug combination response prediction! Checkout here for usage.
2021.01.16 TDC updates to 0.1.5
, major changes:
- New Oracles! Added four realistic oracles from docking scores and synthetic accessibility scores! Checkout here for usage.
2021.01.09 TDC updates to 0.1.4
, major changes:
- New Function! Added a data processing helper to map among ~15 molecular formats in 2 lines of code (For 2D: from SMILES/SEFLIES and convert to SELFIES/SMILES, Graph2D, PyG, DGL, ECFP2-6, MACCS, Daylight, RDKit2D, Morgan, PubChem; For 3D: from XYZ, SDF files to Graph3D, Columb Matrix). Checkout here for usage.
- Quality Check! Canonicalize SMILES on DTI datasets with Drug, Target IDs added. Checkout
DTI
.
2020.12.30 TDC updates to 0.1.3
, major changes:
- New Dataset! Added a new therapeutic task CRISPR Repair Outcome Prediction! Checkout
CRISPROutcome
. - New Function! Added a data processing helper to map SMILES string to popular cheminformatics fingerprints (ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, Pubchem)! Checkout here for usage.
2020.12.24 TDC updates to 0.1.2
, major changes:
- Leaderboard Release! TDC's first leaderboard on ADMET prediction is released. You can find the leaderboard guide here, where we provide a
BenchmarkGroup
class to do model building on leaderboard tasks rapidly. The ADMET leaderboard is here.
2020.12.19 TDC updates to 0.1.1
, major changes:
- Quality Check and New datasets! We replaced VD, Half Life and Clearance datasets in
ADME
from new sources that have higher qualities. We also added LD50 toTox
.
2020.12.15 TDC updates to 0.1.0
, major changes:
- Five New Datasets! Added CYP2C9/2D6/3A4 Substrate, for
ADME
, Carcinogens forTox
and NCI-60 forDrugSyn
. - Quality Check. We conducted a canonicalization of all SMILES and removed ones that return errors in the
ADME
,Tox
, andHTS
datasets.
2020.11.30 TDC updates to 0.0.8
, major changes:
- Five New Datasets! Added hREG, DILI (Drug Induced Liver Injury), Skin Reaction, Ames Mutagenicity for
Tox
and PPBR from AstraZeneca forADME
. - Distribution Learning Metrics Moved to Evaluators. Checkout here for the updated usage.
- Meta Oracles. We included a helper function where you can specify your own set of molecules for Rediscovery, Similarity, Medians, Isomers. Checkout an example usage here.
- Tutorials. We have provided various tutorials for you to start using TDC. Click here .