Retrosynthesis Prediction Task Overview
Definition: Retrosynthesis is the process of finding a set of reactants that can synthesize a target molecule, i.e., product, which is a fundamental task in drug manufacturing. The target is recursively transformed into simpler precursor molecules until commercially available "starting" molecules are identified. In a data sample, there is only one product molecule, reactants can be one or multiple molecules. Retrosynthesis prediction can be seen as reverse process of Reaction outcome prediction.
Impact: Retrosynthesis planning is useful for chemists to design synthetic routes to target molecules. Computational retrosynthetic analysis tools can potentially greatly assist chemists in designing synthetic routes to novel molecules. Machine learning based methods will significantly save the time and cost.
Generalization: The model is expected to accurately generate reactant sets for novel drug candidates with distinct structures from the training set across reaction types with varying reaction conditions.
Product: Small-molecule.
Pipeline: Manufacturing - Synthesis planning.
USPTO-50K
Dataset Description: USPTO (United States Patent and Trademark Office) 50K consists of 50K extracted atom-mapped reactions with 10 reaction types.
Task Description: Given the product X, generate the reactant set Y.
Dataset Statistics: 50,036 reactions.
Dataset Split: Random Split
from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO-50K')
split = data.get_split()
Note: To get the reaction types of each reaction, you can type:
from tdc.utils import get_reaction_type
get_reaction_type('USPTO-50K')
Note: Starting 0.3.5
, you can also get the reaction type in the dataframe file in each split by turning on
split = data.get_split(include_reaction_type = True)
References:
Dataset License: CC0.
USPTO
Dataset Description: The full USPTO (United States Patent and Trademark Office) retrosynthesis dataset.
Task Description: Given the product X, generate the reactant set Y.
Dataset Statistics: 1,939,253 reactions.
Dataset Split: Random Split
from tdc.generation import RetroSyn
data = RetroSyn(name = 'USPTO')
split = data.get_split()
References:
[1] Daniel Lowe. Chemical reactions from US patents (1976-Sep2016).
Dataset License: CC0.